Authors: Carlos Iborra Llopis (100451170), Alejandra Galán Arrospide (100451273)
For additional notes and requirements: https://github.com/carlosiborra/Grupo02-Practica1-AprendizajeAutomatico
❗If you are willing to run the code yourself, please clone the full GitHub repository, as it contains the necessary folder structures to export images and results❗
""" Importing necessary libraries """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns
import scipy.stats as st
import scipy
import sklearn
from matplotlib.cbook import boxplot_stats as bps
This way we avoid creating multiple images and sending the old ones to the trash.
Also using this to upload cleaner commits to GitHub.
""" Cleaning the ../data/img/ folder """
import os
import glob
files = glob.glob("../data/img/*")
for f in files:
if os.path.isfile(f) and f.endswith(".png"):
os.remove(f)
files = glob.glob("../data/img/box-plot/*")
for f in files:
if os.path.isfile(f) and f.endswith(".png"):
os.remove(f)
""" Reading the dataset """
disp_df = pd.read_csv("../data/disp_st2ns1.txt.bz2", compression="bz2", index_col=0)
comp_df = pd.read_csv("../data/comp_st2ns1.txt.bz2", compression="bz2", index_col=0)
Key Concepts of Exploratory Data Analysis
To conduct exploratory data analysis (EDA) on our real data, we need to prepare the data first. Therefore, we have decided to separate the data into training and test sets at an early stage to avoid data leakage, which could result in an overly optimistic evaluation of the model, among other consequences. This separation will be done by dividing the data prematurely into training and test sets since potential data leakage can occur from the usage of the test partition, especially when including the result variable.
It is important to note that this step is necessary because all the information obtained in this section will be used to make decisions such as dimensionality reduction. Furthermore, this approach makes the evaluation more realistic and rigorous since the test set is not used until the end of the process.
""" Train Test Split (time series) """
# * Make a copy of the dataframe (as Padas dataframe is mutable, therefore uses a reference)
disp_df_copy = disp_df.copy()
# print(disp_df)
# print(disp_df_copy)
# Now we make the train_x, train_y, test_x, test_y splits taking into account the time series
# Note: the time series is ordered by date, therefore we need to split the data in a way that the train data is before the test data
# Note: the 10 first years are used for training and the last two years for testing
# Note: this is done because if not, we will be predicting the past from the future, which leads to errors and overfitting (data leakage) in the model
# * Calculate the number of rows for training and testing
num_rows = disp_df_copy.shape[0]
num_train_rows = int(
num_rows * 10 / 12
) # 10 first years for training, 2 last years for testing
# * Split the data into train and test dataframes (using iloc instead of train_test_split as it picks random rows)
train_df = disp_df_copy.iloc[
:num_train_rows, :
] # train contains the first 10 years of rows
test_df = disp_df_copy.iloc[
num_train_rows:, :
] # test contains the last 2 years of rows
# Print the number of rows for each dataframe
print(f"Number of rows for training (EDA): {train_df.shape[0]}")
print(f"Number of rows for testing: {test_df.shape[0]}")
# ! We maintain the original dataframe for later use (as we will divide it into train and test dataframes below)
# ! For the EDA, we will use the train_df dataframe (with the outpout variable).
Number of rows for training (EDA): 3650 Number of rows for testing: 730
# Display all the columns of the dataframe
pd.set_option("display.max_columns", None)
train_df.describe()
| apcp_sf1_1 | apcp_sf2_1 | apcp_sf3_1 | apcp_sf4_1 | apcp_sf5_1 | dlwrf_s1_1 | dlwrf_s2_1 | dlwrf_s3_1 | dlwrf_s4_1 | dlwrf_s5_1 | dswrf_s1_1 | dswrf_s2_1 | dswrf_s3_1 | dswrf_s4_1 | dswrf_s5_1 | pres_ms1_1 | pres_ms2_1 | pres_ms3_1 | pres_ms4_1 | pres_ms5_1 | pwat_ea1_1 | pwat_ea2_1 | pwat_ea3_1 | pwat_ea4_1 | pwat_ea5_1 | spfh_2m1_1 | spfh_2m2_1 | spfh_2m3_1 | spfh_2m4_1 | spfh_2m5_1 | tcdc_ea1_1 | tcdc_ea2_1 | tcdc_ea3_1 | tcdc_ea4_1 | tcdc_ea5_1 | tcolc_e1_1 | tcolc_e2_1 | tcolc_e3_1 | tcolc_e4_1 | tcolc_e5_1 | tmax_2m1_1 | tmax_2m2_1 | tmax_2m3_1 | tmax_2m4_1 | tmax_2m5_1 | tmin_2m1_1 | tmin_2m2_1 | tmin_2m3_1 | tmin_2m4_1 | tmin_2m5_1 | tmp_2m_1_1 | tmp_2m_2_1 | tmp_2m_3_1 | tmp_2m_4_1 | tmp_2m_5_1 | tmp_sfc1_1 | tmp_sfc2_1 | tmp_sfc3_1 | tmp_sfc4_1 | tmp_sfc5_1 | ulwrf_s1_1 | ulwrf_s2_1 | ulwrf_s3_1 | ulwrf_s4_1 | ulwrf_s5_1 | ulwrf_t1_1 | ulwrf_t2_1 | ulwrf_t3_1 | ulwrf_t4_1 | ulwrf_t5_1 | uswrf_s1_1 | uswrf_s2_1 | uswrf_s3_1 | uswrf_s4_1 | uswrf_s5_1 | salida | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3.650000e+03 |
| mean | 0.610222 | 0.251049 | 0.479367 | 0.279969 | 0.525625 | 316.590458 | 316.996492 | 324.225574 | 343.169304 | 342.582550 | 0.074371 | 163.928966 | 376.718929 | 686.534869 | 508.429988 | 101718.580471 | 101774.517076 | 101743.013770 | 101538.253073 | 101499.397514 | 21.394485 | 21.536129 | 22.127195 | 22.595594 | 22.384870 | 0.007844 | 0.008848 | 0.009356 | 0.009473 | 0.009918 | 0.069240 | 0.067845 | 0.064862 | 0.065706 | 0.062366 | 0.069539 | 0.068172 | 0.065166 | 0.066036 | 0.062748 | 286.950030 | 288.292227 | 292.803749 | 294.483694 | 294.542492 | 284.595935 | 284.638684 | 284.617400 | 292.733513 | 291.084714 | 284.846286 | 288.227387 | 292.740802 | 294.299550 | 291.301035 | 284.094056 | 289.230769 | 295.533258 | 295.904819 | 290.366407 | 375.991521 | 381.989673 | 400.742449 | 439.104661 | 431.318749 | 247.736467 | 247.626828 | 251.950057 | 262.207928 | 261.074238 | 0.078107 | 38.716712 | 76.394795 | 127.098207 | 99.476613 | 1.638200e+07 |
| std | 2.245850 | 0.994112 | 1.756408 | 1.120933 | 1.931408 | 56.119896 | 58.129352 | 58.941747 | 61.150202 | 61.027007 | 0.305126 | 112.645372 | 159.486316 | 227.642854 | 193.753483 | 725.206610 | 731.500969 | 720.701217 | 699.477989 | 715.361146 | 12.256253 | 12.358856 | 12.583364 | 12.633154 | 12.401121 | 0.004398 | 0.005039 | 0.005175 | 0.005097 | 0.005456 | 0.167104 | 0.169653 | 0.171287 | 0.172516 | 0.166113 | 0.166989 | 0.169522 | 0.171172 | 0.172385 | 0.165958 | 8.925065 | 9.743169 | 9.898253 | 9.789117 | 9.776615 | 8.735982 | 8.862301 | 8.866503 | 9.950300 | 10.099684 | 8.722593 | 9.795209 | 9.944761 | 9.795537 | 10.083859 | 8.861650 | 9.756852 | 9.148308 | 9.317363 | 10.462108 | 46.586515 | 49.914820 | 50.766618 | 53.159310 | 54.417631 | 36.270918 | 36.289003 | 35.798277 | 38.698726 | 38.427066 | 0.258752 | 26.010130 | 30.743175 | 40.765618 | 35.505727 | 8.059674e+06 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 158.971770 | 160.032903 | 165.524543 | 183.671312 | 186.342961 | 0.000000 | 0.000000 | 20.000000 | 30.000000 | 20.000000 | 99316.970881 | 99315.887074 | 99327.755682 | 99040.100852 | 98830.153409 | 1.100000 | 1.314819 | 1.107352 | 1.142803 | 1.201246 | 0.000462 | 0.000485 | 0.000451 | 0.000478 | 0.000468 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 254.589220 | 254.937418 | 258.549777 | 260.800365 | 260.863475 | 251.941358 | 249.576132 | 249.576714 | 258.698331 | 258.171345 | 251.942065 | 254.844406 | 258.552646 | 260.795430 | 258.170049 | 250.100794 | 256.360800 | 263.634377 | 264.533564 | 256.520408 | 229.296161 | 223.985486 | 246.314349 | 278.576630 | 271.707606 | 104.671267 | 113.559602 | 118.679132 | 119.393449 | 121.951425 | 0.000000 | 0.000000 | 3.181818 | 4.363636 | 2.545455 | 5.100000e+05 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 270.043573 | 267.583016 | 275.008281 | 292.786299 | 291.777096 | 0.000000 | 52.727273 | 240.000000 | 525.454545 | 344.477273 | 101266.472124 | 101311.399680 | 101283.033381 | 101102.175426 | 101049.033203 | 10.879000 | 10.718024 | 11.122964 | 11.558385 | 11.559638 | 0.003991 | 0.004229 | 0.004617 | 0.004736 | 0.004754 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000618 | 0.000564 | 0.000545 | 0.000673 | 0.000727 | 280.127903 | 280.413352 | 284.916120 | 286.996939 | 287.043590 | 277.844468 | 277.666665 | 277.653641 | 284.828555 | 283.228638 | 278.075898 | 280.285865 | 284.764714 | 286.841753 | 283.495367 | 277.025516 | 281.298508 | 288.689326 | 288.982901 | 281.922856 | 338.180517 | 340.208757 | 358.706337 | 398.061333 | 388.214025 | 230.536257 | 230.759227 | 234.398558 | 246.594118 | 244.431134 | 0.000000 | 14.000000 | 53.818182 | 108.818182 | 74.909091 | 1.061385e+07 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 319.801794 | 321.400251 | 328.456741 | 345.402277 | 345.107513 | 0.000000 | 150.000000 | 384.318182 | 730.000000 | 525.636364 | 101645.975852 | 101704.351207 | 101674.750000 | 101472.419389 | 101425.127131 | 19.191209 | 19.163636 | 19.650000 | 20.290909 | 20.194252 | 0.007246 | 0.008248 | 0.008909 | 0.009156 | 0.009518 | 0.004545 | 0.004545 | 0.003636 | 0.003636 | 0.002727 | 0.005118 | 0.004855 | 0.004145 | 0.004345 | 0.003673 | 287.597683 | 289.039583 | 293.757317 | 295.529044 | 295.598358 | 285.070623 | 285.192374 | 285.106049 | 293.606805 | 291.806785 | 285.381022 | 288.990954 | 293.704959 | 295.310534 | 292.050580 | 284.581711 | 290.050444 | 296.204518 | 296.669453 | 291.126846 | 376.267101 | 382.791372 | 401.524648 | 440.987373 | 433.520339 | 253.350231 | 253.394166 | 257.342928 | 270.790095 | 269.287814 | 0.000000 | 35.500000 | 79.636364 | 136.636364 | 105.454545 | 1.638195e+07 |
| 75% | 0.114545 | 0.051818 | 0.121591 | 0.033636 | 0.090000 | 367.134144 | 370.342597 | 378.683015 | 399.545104 | 398.891589 | 0.000000 | 264.454545 | 524.636364 | 893.636364 | 693.681818 | 102131.380504 | 102188.719283 | 102148.666726 | 101940.005327 | 101919.873402 | 31.188882 | 31.471632 | 32.439831 | 33.103788 | 32.459132 | 0.011612 | 0.013523 | 0.014275 | 0.014169 | 0.015062 | 0.055455 | 0.056364 | 0.042727 | 0.043636 | 0.038182 | 0.056136 | 0.056918 | 0.042900 | 0.043523 | 0.038843 | 294.329548 | 296.940075 | 301.377123 | 302.732298 | 302.753016 | 292.290495 | 292.587547 | 292.576612 | 301.351051 | 299.909759 | 292.551771 | 296.945783 | 301.346064 | 302.621007 | 300.068969 | 292.110791 | 297.882618 | 303.161016 | 303.555412 | 299.639197 | 416.508387 | 427.792698 | 445.458675 | 482.899051 | 476.327880 | 274.309069 | 274.750930 | 278.752231 | 289.945510 | 289.588822 | 0.000000 | 62.000000 | 103.068182 | 155.454545 | 129.727273 | 2.329185e+07 |
| max | 34.428182 | 16.846364 | 28.399091 | 26.381818 | 36.875455 | 426.173970 | 427.486894 | 429.693146 | 455.566337 | 453.910406 | 3.000000 | 381.818182 | 642.181818 | 990.000000 | 791.090909 | 104688.396307 | 104856.285511 | 104693.185369 | 104244.932528 | 104249.968040 | 60.327273 | 58.876881 | 59.915362 | 59.309182 | 60.529133 | 0.018809 | 0.019533 | 0.020985 | 0.021932 | 0.023318 | 1.920909 | 2.370000 | 2.449091 | 2.146364 | 1.957273 | 1.920136 | 2.369282 | 2.450482 | 2.146409 | 1.956655 | 304.480122 | 304.792880 | 311.277519 | 312.660564 | 312.668726 | 300.350930 | 299.724509 | 299.735546 | 310.815957 | 308.761763 | 300.344230 | 304.773410 | 311.272270 | 312.595520 | 308.827304 | 299.869093 | 306.834309 | 315.964081 | 313.965757 | 308.270147 | 470.753102 | 469.429213 | 504.584351 | 555.704024 | 542.529280 | 318.245345 | 311.991660 | 315.569164 | 328.920274 | 327.253141 | 1.000000 | 92.272727 | 192.636364 | 450.636364 | 313.909091 | 3.122700e+07 |
train_df.shape
(3650, 76)
train_df.head()
| apcp_sf1_1 | apcp_sf2_1 | apcp_sf3_1 | apcp_sf4_1 | apcp_sf5_1 | dlwrf_s1_1 | dlwrf_s2_1 | dlwrf_s3_1 | dlwrf_s4_1 | dlwrf_s5_1 | dswrf_s1_1 | dswrf_s2_1 | dswrf_s3_1 | dswrf_s4_1 | dswrf_s5_1 | pres_ms1_1 | pres_ms2_1 | pres_ms3_1 | pres_ms4_1 | pres_ms5_1 | pwat_ea1_1 | pwat_ea2_1 | pwat_ea3_1 | pwat_ea4_1 | pwat_ea5_1 | spfh_2m1_1 | spfh_2m2_1 | spfh_2m3_1 | spfh_2m4_1 | spfh_2m5_1 | tcdc_ea1_1 | tcdc_ea2_1 | tcdc_ea3_1 | tcdc_ea4_1 | tcdc_ea5_1 | tcolc_e1_1 | tcolc_e2_1 | tcolc_e3_1 | tcolc_e4_1 | tcolc_e5_1 | tmax_2m1_1 | tmax_2m2_1 | tmax_2m3_1 | tmax_2m4_1 | tmax_2m5_1 | tmin_2m1_1 | tmin_2m2_1 | tmin_2m3_1 | tmin_2m4_1 | tmin_2m5_1 | tmp_2m_1_1 | tmp_2m_2_1 | tmp_2m_3_1 | tmp_2m_4_1 | tmp_2m_5_1 | tmp_sfc1_1 | tmp_sfc2_1 | tmp_sfc3_1 | tmp_sfc4_1 | tmp_sfc5_1 | ulwrf_s1_1 | ulwrf_s2_1 | ulwrf_s3_1 | ulwrf_s4_1 | ulwrf_s5_1 | ulwrf_t1_1 | ulwrf_t2_1 | ulwrf_t3_1 | ulwrf_t4_1 | ulwrf_t5_1 | uswrf_s1_1 | uswrf_s2_1 | uswrf_s3_1 | uswrf_s4_1 | uswrf_s5_1 | salida | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| V1 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 268.583582 | 244.241641 | 251.174486 | 269.741308 | 268.377441 | 0.0 | 30.0 | 220.000000 | 510.000000 | 330.000000 | 101832.056108 | 102053.159091 | 102090.046165 | 101934.175426 | 101988.003551 | 5.879193 | 7.018182 | 8.460800 | 9.418182 | 9.727869 | 0.003229 | 0.002993 | 0.003775 | 0.003870 | 0.003855 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000909 | 0.000818 | 0.000264 | 0.000255 | 0.000500 | 0.002218 | 280.789784 | 279.627444 | 285.727761 | 286.881681 | 286.885823 | 279.198020 | 278.472615 | 278.474720 | 285.799685 | 280.966961 | 279.249256 | 279.612202 | 285.742784 | 286.841053 | 280.960865 | 277.278370 | 279.250383 | 288.826760 | 288.596086 | 278.500078 | 341.122231 | 335.067918 | 354.626126 | 397.774053 | 383.281225 | 222.153166 | 252.504475 | 254.760271 | 263.342404 | 260.067843 | 0.0 | 10.000000 | 50.000000 | 106.636364 | 72.000000 | 11930700 |
| V2 | 0.0 | 0.0 | 0.0 | 0.008182 | 0.2 | 251.725869 | 255.824126 | 272.163913 | 318.259924 | 307.929083 | 0.0 | 30.0 | 173.636364 | 333.636364 | 224.545455 | 101425.883523 | 101284.509233 | 101253.654830 | 100999.313920 | 101424.626420 | 12.534339 | 11.987316 | 12.159355 | 12.313590 | 13.469729 | 0.003737 | 0.003931 | 0.004015 | 0.003994 | 0.004826 | 0.037273 | 0.021818 | 0.101818 | 0.084545 | 0.109091 | 0.037155 | 0.021309 | 0.102373 | 0.085827 | 0.109336 | 278.822329 | 278.063379 | 283.618583 | 286.606684 | 286.643397 | 277.258919 | 276.740628 | 276.740628 | 283.687009 | 282.111078 | 277.282621 | 278.070390 | 283.604600 | 286.554729 | 282.105011 | 275.830009 | 278.269459 | 287.048970 | 287.325478 | 281.005252 | 330.159915 | 329.354673 | 347.524819 | 388.017767 | 378.773804 | 236.836691 | 233.458263 | 233.027276 | 212.652054 | 222.052916 | 0.0 | 8.181818 | 35.909091 | 58.181818 | 42.090909 | 9778500 |
| V3 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 219.734547 | 211.996022 | 216.405820 | 235.529123 | 239.840132 | 0.0 | 30.0 | 220.000000 | 523.636364 | 337.545455 | 102253.654119 | 102301.918324 | 102088.093750 | 101652.815341 | 101543.146307 | 5.726770 | 5.458528 | 5.700000 | 7.163636 | 9.536364 | 0.002003 | 0.001919 | 0.002107 | 0.002431 | 0.002583 | 0.000000 | 0.000000 | 0.007273 | 0.007273 | 0.042727 | 0.001427 | 0.001582 | 0.007309 | 0.006973 | 0.042127 | 275.400091 | 270.222512 | 275.885787 | 279.049513 | 279.381653 | 269.756037 | 269.157731 | 269.156439 | 276.041792 | 275.301960 | 269.766876 | 270.204285 | 275.880818 | 279.064603 | 275.806757 | 269.533059 | 271.690993 | 281.759993 | 282.686446 | 273.615503 | 309.639845 | 299.751961 | 317.250763 | 364.339136 | 351.496665 | 238.655654 | 232.828737 | 235.480750 | 245.177331 | 238.893102 | 0.0 | 10.272727 | 55.272727 | 118.454545 | 79.181818 | 9771900 |
| V4 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 253.499410 | 230.896544 | 235.857221 | 240.274556 | 237.804048 | 0.0 | 30.0 | 208.181818 | 512.727273 | 337.181818 | 102110.375710 | 102435.603693 | 102688.528409 | 102588.876420 | 102598.252841 | 7.889904 | 6.768959 | 6.208357 | 5.977267 | 6.411838 | 0.002918 | 0.002735 | 0.002771 | 0.002821 | 0.002738 | 0.000000 | 0.002727 | 0.005455 | 0.000909 | 0.012727 | 0.000473 | 0.004018 | 0.007300 | 0.001600 | 0.014882 | 279.396046 | 276.176919 | 276.868630 | 278.550368 | 278.572038 | 276.175482 | 273.839142 | 273.840535 | 276.942990 | 273.802970 | 276.312428 | 274.045715 | 276.877749 | 278.571555 | 273.812827 | 274.824765 | 274.466433 | 281.291418 | 281.871679 | 272.191753 | 330.310971 | 318.761563 | 329.305478 | 360.297788 | 348.618319 | 236.784869 | 241.916776 | 243.398572 | 251.473036 | 247.503769 | 0.0 | 8.909091 | 46.000000 | 107.090909 | 73.636364 | 6466800 |
| V5 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 234.890020 | 238.927051 | 246.850822 | 271.577246 | 275.572826 | 0.0 | 30.0 | 220.000000 | 517.272727 | 336.363636 | 101750.317472 | 101331.333807 | 100921.029119 | 100422.514915 | 100309.059659 | 10.783448 | 10.425542 | 10.362327 | 8.829511 | 9.647615 | 0.003274 | 0.003269 | 0.003066 | 0.003483 | 0.003788 | 0.000909 | 0.000909 | 0.000909 | 0.014545 | 0.050909 | 0.001673 | 0.001836 | 0.001373 | 0.015909 | 0.049591 | 273.294803 | 275.018022 | 283.542744 | 288.171156 | 288.265137 | 272.858415 | 273.303902 | 273.306355 | 283.734819 | 283.735446 | 273.314844 | 274.990234 | 283.563099 | 288.178922 | 285.567946 | 272.260426 | 275.132668 | 285.698725 | 288.490562 | 283.121391 | 310.023179 | 314.763264 | 334.042186 | 388.737835 | 383.409776 | 233.641681 | 233.706659 | 239.952805 | 258.128188 | 253.200684 | 0.0 | 8.909091 | 48.909091 | 106.272727 | 71.818182 | 11545200 |
train_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 3650 entries, V1 to V3650 Data columns (total 76 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 apcp_sf1_1 3650 non-null float64 1 apcp_sf2_1 3650 non-null float64 2 apcp_sf3_1 3650 non-null float64 3 apcp_sf4_1 3650 non-null float64 4 apcp_sf5_1 3650 non-null float64 5 dlwrf_s1_1 3650 non-null float64 6 dlwrf_s2_1 3650 non-null float64 7 dlwrf_s3_1 3650 non-null float64 8 dlwrf_s4_1 3650 non-null float64 9 dlwrf_s5_1 3650 non-null float64 10 dswrf_s1_1 3650 non-null float64 11 dswrf_s2_1 3650 non-null float64 12 dswrf_s3_1 3650 non-null float64 13 dswrf_s4_1 3650 non-null float64 14 dswrf_s5_1 3650 non-null float64 15 pres_ms1_1 3650 non-null float64 16 pres_ms2_1 3650 non-null float64 17 pres_ms3_1 3650 non-null float64 18 pres_ms4_1 3650 non-null float64 19 pres_ms5_1 3650 non-null float64 20 pwat_ea1_1 3650 non-null float64 21 pwat_ea2_1 3650 non-null float64 22 pwat_ea3_1 3650 non-null float64 23 pwat_ea4_1 3650 non-null float64 24 pwat_ea5_1 3650 non-null float64 25 spfh_2m1_1 3650 non-null float64 26 spfh_2m2_1 3650 non-null float64 27 spfh_2m3_1 3650 non-null float64 28 spfh_2m4_1 3650 non-null float64 29 spfh_2m5_1 3650 non-null float64 30 tcdc_ea1_1 3650 non-null float64 31 tcdc_ea2_1 3650 non-null float64 32 tcdc_ea3_1 3650 non-null float64 33 tcdc_ea4_1 3650 non-null float64 34 tcdc_ea5_1 3650 non-null float64 35 tcolc_e1_1 3650 non-null float64 36 tcolc_e2_1 3650 non-null float64 37 tcolc_e3_1 3650 non-null float64 38 tcolc_e4_1 3650 non-null float64 39 tcolc_e5_1 3650 non-null float64 40 tmax_2m1_1 3650 non-null float64 41 tmax_2m2_1 3650 non-null float64 42 tmax_2m3_1 3650 non-null float64 43 tmax_2m4_1 3650 non-null float64 44 tmax_2m5_1 3650 non-null float64 45 tmin_2m1_1 3650 non-null float64 46 tmin_2m2_1 3650 non-null float64 47 tmin_2m3_1 3650 non-null float64 48 tmin_2m4_1 3650 non-null float64 49 tmin_2m5_1 3650 non-null float64 50 tmp_2m_1_1 3650 non-null float64 51 tmp_2m_2_1 3650 non-null float64 52 tmp_2m_3_1 3650 non-null float64 53 tmp_2m_4_1 3650 non-null float64 54 tmp_2m_5_1 3650 non-null float64 55 tmp_sfc1_1 3650 non-null float64 56 tmp_sfc2_1 3650 non-null float64 57 tmp_sfc3_1 3650 non-null float64 58 tmp_sfc4_1 3650 non-null float64 59 tmp_sfc5_1 3650 non-null float64 60 ulwrf_s1_1 3650 non-null float64 61 ulwrf_s2_1 3650 non-null float64 62 ulwrf_s3_1 3650 non-null float64 63 ulwrf_s4_1 3650 non-null float64 64 ulwrf_s5_1 3650 non-null float64 65 ulwrf_t1_1 3650 non-null float64 66 ulwrf_t2_1 3650 non-null float64 67 ulwrf_t3_1 3650 non-null float64 68 ulwrf_t4_1 3650 non-null float64 69 ulwrf_t5_1 3650 non-null float64 70 uswrf_s1_1 3650 non-null float64 71 uswrf_s2_1 3650 non-null float64 72 uswrf_s3_1 3650 non-null float64 73 uswrf_s4_1 3650 non-null float64 74 uswrf_s5_1 3650 non-null float64 75 salida 3650 non-null int64 dtypes: float64(75), int64(1) memory usage: 2.1+ MB
Fist, we check the number the total number of missing values in the dataset in order to know if we have to clean the dataset or not.
train_df.isna().sum()
apcp_sf1_1 0
apcp_sf2_1 0
apcp_sf3_1 0
apcp_sf4_1 0
apcp_sf5_1 0
..
uswrf_s2_1 0
uswrf_s3_1 0
uswrf_s4_1 0
uswrf_s5_1 0
salida 0
Length: 76, dtype: int64
As we can oberve, there are no missing values in the dataset, but theres still the possibility of having missing values measured as 0's, so we will check if all those zeros make sense in the context of the dataset or not.
# In the plot, we can see that there are a lot of 0 values in the dataset
train_df.plot(legend=False, figsize=(15, 5))
<Axes: >
result = train_df.eq(0.0).sum() / len(train_df) * 100
# Select those columns with more than 30% of zeros
result = result[result > 30.0]
result = result.sort_values(ascending=False)
result
dswrf_s1_1 91.808219 uswrf_s1_1 90.767123 apcp_sf4_1 63.041096 apcp_sf5_1 61.041096 apcp_sf1_1 60.821918 apcp_sf2_1 59.890411 apcp_sf3_1 56.739726 tcdc_ea3_1 37.917808 tcdc_ea1_1 37.808219 tcdc_ea2_1 37.424658 tcdc_ea5_1 36.301370 tcdc_ea4_1 35.726027 dtype: float64
As output of the previous cell, we can see that there exist a lot of zeros in the dataset, let's analize if those zeros make sense or not.
The variables with most ammount of zeros (>30%) are:
First, lets start by assigning the zeros to NaNs. By doing this we can visualize the varibles that take more values other than zero.
disp_df_nan = train_df.replace(0.0, np.nan)
""" Plotting missing values """
# Sustitute 0.0 values with NaN and plot the name of the columns with missing values
# ? msno.bar is a simple visualization of nullity by column
msno.bar(disp_df_nan, labels=True, fontsize=7, figsize=(15, 7))
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_bar.png")
""" Plotting the missing values in a matrix """
# ? The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
msno.matrix(disp_df_nan)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_matrix.png")
""" Plotting the missing values in a heatmap """
# As in a hetmap not every value is shown, we must delimit the values to the ones with more than 30% of missing values
result = disp_df.eq(0.0).sum() / len(disp_df) * 100
result = result[result > 30.0] # Select those columns with more than 30% of zeros
result = result.sort_values(ascending=False)
result = result.index.tolist() # Convert to list
result
# ? The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another
msno.heatmap(disp_df_nan[result], fontsize=7, figsize=(15, 7))
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_heatmap.png")
""" Plotting the dendrogram """
# ? The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:
msno.dendrogram(disp_df_nan, orientation="top", fontsize=7, figsize=(15, 7))
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_dendrogram.png")
In this section, we have observe that there are no attibutes with 'Null' nor 'NaN' nor 'None' values. This indicated that at a first glance, the data is clean, at least of those datatypes.
In second place, we have observed that the attributes that we suspected could have an important number of missing values (represented by 0 instead of the previously mentioned), had instead valuable information, as we have proved along this section.
Since the data is clean and we have concluded there are no missing values, we do not need to complete them using a model or other methods, so we can move on to the next step, observing the outliers.
Detecting outliers in a dataset before training a model is crucial because outliers can significantly affect the performance and accuracy of the model. Outliers are data points that deviate significantly from the rest of the dataset and can cause the model to learn incorrect patterns and relationships. When outliers are present in the data, they can also increase the variance of the model, which can result in overfitting. Overfitting occurs when the model fits too closely to the training data, leading to poor generalization to new data. Therefore, it is important to detect and handle outliers properly to ensure the model's accuracy and robustness.
list_of_attributes = train_df.columns.values.tolist()
#print(list_of_attributes)
# Boxplot with all attributes in the dataset
# sns.boxplot(data=train_df, orient="h")
# plt.show()
train_df.describe()
| apcp_sf1_1 | apcp_sf2_1 | apcp_sf3_1 | apcp_sf4_1 | apcp_sf5_1 | dlwrf_s1_1 | dlwrf_s2_1 | dlwrf_s3_1 | dlwrf_s4_1 | dlwrf_s5_1 | dswrf_s1_1 | dswrf_s2_1 | dswrf_s3_1 | dswrf_s4_1 | dswrf_s5_1 | pres_ms1_1 | pres_ms2_1 | pres_ms3_1 | pres_ms4_1 | pres_ms5_1 | pwat_ea1_1 | pwat_ea2_1 | pwat_ea3_1 | pwat_ea4_1 | pwat_ea5_1 | spfh_2m1_1 | spfh_2m2_1 | spfh_2m3_1 | spfh_2m4_1 | spfh_2m5_1 | tcdc_ea1_1 | tcdc_ea2_1 | tcdc_ea3_1 | tcdc_ea4_1 | tcdc_ea5_1 | tcolc_e1_1 | tcolc_e2_1 | tcolc_e3_1 | tcolc_e4_1 | tcolc_e5_1 | tmax_2m1_1 | tmax_2m2_1 | tmax_2m3_1 | tmax_2m4_1 | tmax_2m5_1 | tmin_2m1_1 | tmin_2m2_1 | tmin_2m3_1 | tmin_2m4_1 | tmin_2m5_1 | tmp_2m_1_1 | tmp_2m_2_1 | tmp_2m_3_1 | tmp_2m_4_1 | tmp_2m_5_1 | tmp_sfc1_1 | tmp_sfc2_1 | tmp_sfc3_1 | tmp_sfc4_1 | tmp_sfc5_1 | ulwrf_s1_1 | ulwrf_s2_1 | ulwrf_s3_1 | ulwrf_s4_1 | ulwrf_s5_1 | ulwrf_t1_1 | ulwrf_t2_1 | ulwrf_t3_1 | ulwrf_t4_1 | ulwrf_t5_1 | uswrf_s1_1 | uswrf_s2_1 | uswrf_s3_1 | uswrf_s4_1 | uswrf_s5_1 | salida | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3650.000000 | 3.650000e+03 |
| mean | 0.610222 | 0.251049 | 0.479367 | 0.279969 | 0.525625 | 316.590458 | 316.996492 | 324.225574 | 343.169304 | 342.582550 | 0.074371 | 163.928966 | 376.718929 | 686.534869 | 508.429988 | 101718.580471 | 101774.517076 | 101743.013770 | 101538.253073 | 101499.397514 | 21.394485 | 21.536129 | 22.127195 | 22.595594 | 22.384870 | 0.007844 | 0.008848 | 0.009356 | 0.009473 | 0.009918 | 0.069240 | 0.067845 | 0.064862 | 0.065706 | 0.062366 | 0.069539 | 0.068172 | 0.065166 | 0.066036 | 0.062748 | 286.950030 | 288.292227 | 292.803749 | 294.483694 | 294.542492 | 284.595935 | 284.638684 | 284.617400 | 292.733513 | 291.084714 | 284.846286 | 288.227387 | 292.740802 | 294.299550 | 291.301035 | 284.094056 | 289.230769 | 295.533258 | 295.904819 | 290.366407 | 375.991521 | 381.989673 | 400.742449 | 439.104661 | 431.318749 | 247.736467 | 247.626828 | 251.950057 | 262.207928 | 261.074238 | 0.078107 | 38.716712 | 76.394795 | 127.098207 | 99.476613 | 1.638200e+07 |
| std | 2.245850 | 0.994112 | 1.756408 | 1.120933 | 1.931408 | 56.119896 | 58.129352 | 58.941747 | 61.150202 | 61.027007 | 0.305126 | 112.645372 | 159.486316 | 227.642854 | 193.753483 | 725.206610 | 731.500969 | 720.701217 | 699.477989 | 715.361146 | 12.256253 | 12.358856 | 12.583364 | 12.633154 | 12.401121 | 0.004398 | 0.005039 | 0.005175 | 0.005097 | 0.005456 | 0.167104 | 0.169653 | 0.171287 | 0.172516 | 0.166113 | 0.166989 | 0.169522 | 0.171172 | 0.172385 | 0.165958 | 8.925065 | 9.743169 | 9.898253 | 9.789117 | 9.776615 | 8.735982 | 8.862301 | 8.866503 | 9.950300 | 10.099684 | 8.722593 | 9.795209 | 9.944761 | 9.795537 | 10.083859 | 8.861650 | 9.756852 | 9.148308 | 9.317363 | 10.462108 | 46.586515 | 49.914820 | 50.766618 | 53.159310 | 54.417631 | 36.270918 | 36.289003 | 35.798277 | 38.698726 | 38.427066 | 0.258752 | 26.010130 | 30.743175 | 40.765618 | 35.505727 | 8.059674e+06 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 158.971770 | 160.032903 | 165.524543 | 183.671312 | 186.342961 | 0.000000 | 0.000000 | 20.000000 | 30.000000 | 20.000000 | 99316.970881 | 99315.887074 | 99327.755682 | 99040.100852 | 98830.153409 | 1.100000 | 1.314819 | 1.107352 | 1.142803 | 1.201246 | 0.000462 | 0.000485 | 0.000451 | 0.000478 | 0.000468 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 254.589220 | 254.937418 | 258.549777 | 260.800365 | 260.863475 | 251.941358 | 249.576132 | 249.576714 | 258.698331 | 258.171345 | 251.942065 | 254.844406 | 258.552646 | 260.795430 | 258.170049 | 250.100794 | 256.360800 | 263.634377 | 264.533564 | 256.520408 | 229.296161 | 223.985486 | 246.314349 | 278.576630 | 271.707606 | 104.671267 | 113.559602 | 118.679132 | 119.393449 | 121.951425 | 0.000000 | 0.000000 | 3.181818 | 4.363636 | 2.545455 | 5.100000e+05 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 270.043573 | 267.583016 | 275.008281 | 292.786299 | 291.777096 | 0.000000 | 52.727273 | 240.000000 | 525.454545 | 344.477273 | 101266.472124 | 101311.399680 | 101283.033381 | 101102.175426 | 101049.033203 | 10.879000 | 10.718024 | 11.122964 | 11.558385 | 11.559638 | 0.003991 | 0.004229 | 0.004617 | 0.004736 | 0.004754 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000618 | 0.000564 | 0.000545 | 0.000673 | 0.000727 | 280.127903 | 280.413352 | 284.916120 | 286.996939 | 287.043590 | 277.844468 | 277.666665 | 277.653641 | 284.828555 | 283.228638 | 278.075898 | 280.285865 | 284.764714 | 286.841753 | 283.495367 | 277.025516 | 281.298508 | 288.689326 | 288.982901 | 281.922856 | 338.180517 | 340.208757 | 358.706337 | 398.061333 | 388.214025 | 230.536257 | 230.759227 | 234.398558 | 246.594118 | 244.431134 | 0.000000 | 14.000000 | 53.818182 | 108.818182 | 74.909091 | 1.061385e+07 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 319.801794 | 321.400251 | 328.456741 | 345.402277 | 345.107513 | 0.000000 | 150.000000 | 384.318182 | 730.000000 | 525.636364 | 101645.975852 | 101704.351207 | 101674.750000 | 101472.419389 | 101425.127131 | 19.191209 | 19.163636 | 19.650000 | 20.290909 | 20.194252 | 0.007246 | 0.008248 | 0.008909 | 0.009156 | 0.009518 | 0.004545 | 0.004545 | 0.003636 | 0.003636 | 0.002727 | 0.005118 | 0.004855 | 0.004145 | 0.004345 | 0.003673 | 287.597683 | 289.039583 | 293.757317 | 295.529044 | 295.598358 | 285.070623 | 285.192374 | 285.106049 | 293.606805 | 291.806785 | 285.381022 | 288.990954 | 293.704959 | 295.310534 | 292.050580 | 284.581711 | 290.050444 | 296.204518 | 296.669453 | 291.126846 | 376.267101 | 382.791372 | 401.524648 | 440.987373 | 433.520339 | 253.350231 | 253.394166 | 257.342928 | 270.790095 | 269.287814 | 0.000000 | 35.500000 | 79.636364 | 136.636364 | 105.454545 | 1.638195e+07 |
| 75% | 0.114545 | 0.051818 | 0.121591 | 0.033636 | 0.090000 | 367.134144 | 370.342597 | 378.683015 | 399.545104 | 398.891589 | 0.000000 | 264.454545 | 524.636364 | 893.636364 | 693.681818 | 102131.380504 | 102188.719283 | 102148.666726 | 101940.005327 | 101919.873402 | 31.188882 | 31.471632 | 32.439831 | 33.103788 | 32.459132 | 0.011612 | 0.013523 | 0.014275 | 0.014169 | 0.015062 | 0.055455 | 0.056364 | 0.042727 | 0.043636 | 0.038182 | 0.056136 | 0.056918 | 0.042900 | 0.043523 | 0.038843 | 294.329548 | 296.940075 | 301.377123 | 302.732298 | 302.753016 | 292.290495 | 292.587547 | 292.576612 | 301.351051 | 299.909759 | 292.551771 | 296.945783 | 301.346064 | 302.621007 | 300.068969 | 292.110791 | 297.882618 | 303.161016 | 303.555412 | 299.639197 | 416.508387 | 427.792698 | 445.458675 | 482.899051 | 476.327880 | 274.309069 | 274.750930 | 278.752231 | 289.945510 | 289.588822 | 0.000000 | 62.000000 | 103.068182 | 155.454545 | 129.727273 | 2.329185e+07 |
| max | 34.428182 | 16.846364 | 28.399091 | 26.381818 | 36.875455 | 426.173970 | 427.486894 | 429.693146 | 455.566337 | 453.910406 | 3.000000 | 381.818182 | 642.181818 | 990.000000 | 791.090909 | 104688.396307 | 104856.285511 | 104693.185369 | 104244.932528 | 104249.968040 | 60.327273 | 58.876881 | 59.915362 | 59.309182 | 60.529133 | 0.018809 | 0.019533 | 0.020985 | 0.021932 | 0.023318 | 1.920909 | 2.370000 | 2.449091 | 2.146364 | 1.957273 | 1.920136 | 2.369282 | 2.450482 | 2.146409 | 1.956655 | 304.480122 | 304.792880 | 311.277519 | 312.660564 | 312.668726 | 300.350930 | 299.724509 | 299.735546 | 310.815957 | 308.761763 | 300.344230 | 304.773410 | 311.272270 | 312.595520 | 308.827304 | 299.869093 | 306.834309 | 315.964081 | 313.965757 | 308.270147 | 470.753102 | 469.429213 | 504.584351 | 555.704024 | 542.529280 | 318.245345 | 311.991660 | 315.569164 | 328.920274 | 327.253141 | 1.000000 | 92.272727 | 192.636364 | 450.636364 | 313.909091 | 3.122700e+07 |
train_df['apcp_sf1_1'].value_counts()
0.000000 2220
0.000909 54
0.001818 24
0.003636 19
0.002727 19
...
2.356364 1
0.920000 1
0.048182 1
0.211818 1
1.363636 1
Name: apcp_sf1_1, Length: 1170, dtype: int64
Here, by plotting the boxplots and making the outliers (fliers) visible, we are able to see some outliers in the dataset.
Take into account that the outliers are represented by the points outside the boxplot and they can be potentially wrong values or just values that are not usual in the dataset (ruido).
""" Histogram showing the distribtuion of train_df to show the outliers """
plt.hist(train_df)
plt.show()
Here, as in the boxplot, we can see the outliers in the dataset as well as observing the right skewness of the data as we will later see more clearly in the distribution plots.
With the objective of noticing the outliers on each attribute, we create a box-plot of each of the attributes
""" Plotting the boxplot for each attribute and getting the outliers of each attribute """
total_outliers = []
# * We iterate over the list of attributes
for attribute in list_of_attributes:
# * sns.regplot(x=train_df[attribute], y=train_df['total'], fit_reg=False)
sns.boxplot(data=train_df[attribute], x=train_df[attribute], orient="h")
# * Use the command below to show each plot (small size for visualization sake)
# sns.set(rc={'figure.figsize':(1,.5)})
# plt.show()
# * All the images are saved in the folder ../data/img/box-plot
plt.savefig(f"../data/img/box-plot/{str(attribute)}.png")
# We obtain the a list of outliers for each attribute
list_of_outliers = train_df[attribute][train_df[attribute] > train_df[attribute].quantile(0.75) + 1.5*(train_df[attribute].quantile(0.75) - train_df[attribute].quantile(0.25))].tolist()
outliers = [f'{attribute} outliers'] + [len(list_of_outliers)] + [list_of_outliers]
# * In order to print the total number of outliers for each attribute
# print(f'{attribute} has {len(list_of_outliers)} outliers')
# ! Data structure: [attribute, number of outliers, list of outliers]
# print(outliers)
total_outliers.append(outliers)
# print the first 2 elements of each element in the list -> [[atb, num],[atb, num],...]
num_atb_outliers = 0
for i in total_outliers:
if i[1] != 0:
num_atb_outliers += 1
print(f"{i[0:2]}...")
# Number of outliers != 0 for each attribute
print(f"Total number of atributes with outliers: {num_atb_outliers} / {len(total_outliers)-1}")
['apcp_sf1_1 outliers', 693]... ['apcp_sf2_1 outliers', 674]... ['apcp_sf3_1 outliers', 677]... ['apcp_sf4_1 outliers', 761]... ['apcp_sf5_1 outliers', 709]... ['dswrf_s1_1 outliers', 299]... ['pres_ms1_1 outliers', 56]... ['pres_ms2_1 outliers', 55]... ['pres_ms3_1 outliers', 64]... ['pres_ms4_1 outliers', 68]... ['pres_ms5_1 outliers', 58]... ['tcdc_ea1_1 outliers', 514]... ['tcdc_ea2_1 outliers', 525]... ['tcdc_ea3_1 outliers', 575]... ['tcdc_ea4_1 outliers', 549]... ['tcdc_ea5_1 outliers', 559]... ['tcolc_e1_1 outliers', 513]... ['tcolc_e2_1 outliers', 523]... ['tcolc_e3_1 outliers', 575]... ['tcolc_e4_1 outliers', 555]... ['tcolc_e5_1 outliers', 560]... ['uswrf_s1_1 outliers', 337]... ['uswrf_s3_1 outliers', 3]... ['uswrf_s4_1 outliers', 31]... ['uswrf_s5_1 outliers', 9]... Total number of atributes with outliers: 25 / 75
We managed to create a list containing the name of the atribute, the number of outliers and the value of the outliers for each attribute, calculated by applying the IQR method.
This is relevant as we managed to create a 'total_outliers' variable that contains the list data structures of all the different outliers from all the attributes, so that it can be easily accessed in a future to remove the outliers from the dataset if needed for testing purposes.
As suspected, we can see that there are a lot of outliers in the dataset, therefore it is plausible that some of them are noise, thus possibly being removed in a future model in order to improve it (either by hand or by selection in the preprocess pipeline).
Now, we need to analyze if they are the result of bad measurements or if they are significant data for the analysis.
Additionaly, as we will see later, this amount of outliers indicate us that probably a Robust Scaler will be more appropriate than using a Standard Scaler, as the Robust Scaler is more robust to outliers than the Standard Scaler, thus being more appropriate for this dataset model.
Skewness and kurtosis are commonly used to measure the shape of a distribution. Skewness measures the degree of asymmetry in the distribution, while kurtosis measures the degree of flatness in the distribution compared to a normal distribution. We will look for observations that are far from the central tendency of the distribution and may indicate the presence of extreme values or data points that do not fit the pattern of the majority of the data (which as expected, happens to be the case in this dataset).
""" Skewness """
# ? skewness: measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
train_df.skew().sort_values(ascending=False)
apcp_sf4_1 9.297678
apcp_sf2_1 7.610005
apcp_sf5_1 7.244491
apcp_sf3_1 7.241727
apcp_sf1_1 6.783553
...
ulwrf_t1_1 -0.964701
ulwrf_t3_1 -0.989917
ulwrf_t2_1 -1.001763
ulwrf_t5_1 -1.071147
ulwrf_t4_1 -1.196425
Length: 76, dtype: float64
""" Kurtosis """
# ? kurtosis: measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
train_df.kurt().sort_values(ascending=False)
apcp_sf4_1 138.601323
apcp_sf2_1 79.535762
apcp_sf5_1 78.321580
apcp_sf3_1 72.498316
apcp_sf1_1 61.204708
...
uswrf_s2_1 -1.306893
spfh_2m2_1 -1.320499
spfh_2m5_1 -1.321073
dswrf_s2_1 -1.323864
spfh_2m3_1 -1.329847
Length: 76, dtype: float64
y = train_df["apcp_sf4_1"]
plt.figure(1)
plt.title("Normal")
sns.distplot(y, kde=True, fit=st.norm)
plt.figure(2)
plt.title("Log Normal")
sns.distplot(y, kde=True, fit=st.lognorm)
/tmp/ipykernel_7049/2978756091.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(y, kde=True, fit=st.norm) /tmp/ipykernel_7049/2978756091.py:7: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(y, kde=True, fit=st.lognorm)
<Axes: title={'center': 'Log Normal'}, xlabel='apcp_sf4_1', ylabel='Density'>
sns.distplot(train_df.skew(), color="blue", axlabel="Skewness")
/tmp/ipykernel_7049/388743980.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(train_df.skew(), color="blue", axlabel="Skewness")
<Axes: xlabel='Skewness', ylabel='Density'>
plt.figure(figsize=(12, 8))
sns.distplot(
train_df.kurt(), color="r", axlabel="Kurtosis", norm_hist=False, kde=True, rug=False
)
# plt.hist(train.kurt(),orientation = 'vertical',histtype = 'bar',label ='Kurtosis', color ='blue')
plt.show()
/tmp/ipykernel_7049/4054214216.py:2: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(
In this section we are getting information about the correlation of the variables between them. This information is valuable in order to make good decisions when deleting redundant attributes. Also note we are getting information about the correlation between each attribute and the solution variable. This allows us to know the most relevant attributes, making the best decisions when creating the different models.
correlation = train_df.corr()
correlation = abs(correlation)
print(correlation.shape) # 76 x 76 matrix of correlation values
(76, 76)
This is done for the sake of simplicity and to be able to visualize the correlation matrix in a more intuitive way.
correlation_list = []
for column in train_df.columns:
correlation.loc[:, column] = abs(
correlation.iloc[:, train_df.columns.get_loc(column)]
)
mask = correlation.loc[:, column] > 0.95
# print(correlation[column][mask].sort_values(ascending = False))
# Translate the comment below to English:
# we add the correlation values to a list of lists, which contains the names of the correlated columns and their correlation index
# The first segment adds the name of the column we are analyzing
# The second segment adds the names of the columns correlated (except the column we are analyzing) > 0.95
# The third segment adds the correlation index of the columns correlated (except the column we are analyzing) > 0.95
# Second and third segment are added to the first segment as a list of lists
# First we need to create a dictionary with the column names and their correlation values (except the column we are analyzing)
dict = {
key: value
for key, value in correlation.loc[column, mask]
.sort_values(ascending=False)
.iloc[1:]
.to_dict()
.items()
}
# print (dict)
# Then we create a list of lists with the column names and their correlation values from the dictionary created above
corr_list = [[key] + [value] for key, value in dict.items()]
# Finally we add the name of the column we are analyzing to the list of lists created above as the first element of the list (str)
corr_list.insert(0, ["Columna: " + column])
# ! Data structure: [[columna, [columna correlada 1, indice de correlacion], [columna correlada 2, indice de correlacion], ...], ...]
print(corr_list)
correlation_list += [corr_list]
print(correlation_list)
[['Columna: apcp_sf1_1']] [['Columna: apcp_sf2_1']] [['Columna: apcp_sf3_1']] [['Columna: apcp_sf4_1']] [['Columna: apcp_sf5_1']] [['Columna: dlwrf_s1_1'], ['dlwrf_s2_1', 0.9650067922254768], ['dlwrf_s3_1', 0.9547817730760655]] [['Columna: dlwrf_s2_1'], ['dlwrf_s3_1', 0.993701215706055], ['dlwrf_s1_1', 0.9650067922254768]] [['Columna: dlwrf_s3_1'], ['dlwrf_s2_1', 0.993701215706055], ['dlwrf_s4_1', 0.9659874690575408], ['dlwrf_s5_1', 0.9552712673845433], ['dlwrf_s1_1', 0.9547817730760655]] [['Columna: dlwrf_s4_1'], ['dlwrf_s5_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9659874690575408]] [['Columna: dlwrf_s5_1'], ['dlwrf_s4_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9552712673845433]] [['Columna: dswrf_s1_1']] [['Columna: dswrf_s2_1'], ['uswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9503896354343679]] [['Columna: dswrf_s3_1'], ['uswrf_s2_1', 0.9591814530708258], ['dswrf_s2_1', 0.9503896354343679]] [['Columna: dswrf_s4_1'], ['dswrf_s5_1', 0.982758557897581]] [['Columna: dswrf_s5_1'], ['dswrf_s4_1', 0.982758557897581]] [['Columna: pres_ms1_1'], ['pres_ms2_1', 0.9879236602955379], ['pres_ms3_1', 0.956852960202746]] [['Columna: pres_ms2_1'], ['pres_ms1_1', 0.9879236602955379], ['pres_ms3_1', 0.9869377705171734], ['pres_ms4_1', 0.9536176398645005]] [['Columna: pres_ms3_1'], ['pres_ms2_1', 0.9869377705171734], ['pres_ms4_1', 0.9866602703072012], ['pres_ms1_1', 0.956852960202746], ['pres_ms5_1', 0.9538147697170144]] [['Columna: pres_ms4_1'], ['pres_ms3_1', 0.9866602703072012], ['pres_ms5_1', 0.9851755074525863], ['pres_ms2_1', 0.9536176398645005]] [['Columna: pres_ms5_1'], ['pres_ms4_1', 0.9851755074525863], ['pres_ms3_1', 0.9538147697170144]] [['Columna: pwat_ea1_1'], ['pwat_ea2_1', 0.9859484994851248], ['pwat_ea3_1', 0.9577107162594556]] [['Columna: pwat_ea2_1'], ['pwat_ea3_1', 0.9874259658433963], ['pwat_ea1_1', 0.9859484994851248], ['pwat_ea4_1', 0.9618712300670131]] [['Columna: pwat_ea3_1'], ['pwat_ea4_1', 0.9880603787665849], ['pwat_ea2_1', 0.9874259658433963], ['pwat_ea5_1', 0.9616424908340101], ['pwat_ea1_1', 0.9577107162594556]] [['Columna: pwat_ea4_1'], ['pwat_ea3_1', 0.9880603787665849], ['pwat_ea5_1', 0.986763801908917], ['pwat_ea2_1', 0.9618712300670131]] [['Columna: pwat_ea5_1'], ['pwat_ea4_1', 0.986763801908917], ['pwat_ea3_1', 0.9616424908340101]] [['Columna: spfh_2m1_1'], ['spfh_2m2_1', 0.9742691195680059]] [['Columna: spfh_2m2_1'], ['spfh_2m3_1', 0.9846069576918387], ['spfh_2m1_1', 0.9742691195680059], ['spfh_2m4_1', 0.9600698332225309]] [['Columna: spfh_2m3_1'], ['spfh_2m4_1', 0.9891201306737782], ['spfh_2m2_1', 0.9846069576918387], ['spfh_2m5_1', 0.9771699520274281]] [['Columna: spfh_2m4_1'], ['spfh_2m5_1', 0.9904262248914517], ['spfh_2m3_1', 0.9891201306737782], ['spfh_2m2_1', 0.9600698332225309]] [['Columna: spfh_2m5_1'], ['spfh_2m4_1', 0.9904262248914517], ['spfh_2m3_1', 0.9771699520274281]] [['Columna: tcdc_ea1_1'], ['tcolc_e1_1', 0.9999826963362115]] [['Columna: tcdc_ea2_1'], ['tcolc_e2_1', 0.9999837132775715]] [['Columna: tcdc_ea3_1'], ['tcolc_e3_1', 0.9999845616560729]] [['Columna: tcdc_ea4_1'], ['tcolc_e4_1', 0.999984785893167]] [['Columna: tcdc_ea5_1'], ['tcolc_e5_1', 0.9999746391911669]] [['Columna: tcolc_e1_1'], ['tcdc_ea1_1', 0.9999826963362115]] [['Columna: tcolc_e2_1'], ['tcdc_ea2_1', 0.9999837132775715]] [['Columna: tcolc_e3_1'], ['tcdc_ea3_1', 0.9999845616560729]] [['Columna: tcolc_e4_1'], ['tcdc_ea4_1', 0.999984785893167]] [['Columna: tcolc_e5_1'], ['tcdc_ea5_1', 0.9999746391911669]] [['Columna: tmax_2m1_1'], ['ulwrf_s1_1', 0.9925465923917536], ['tmin_2m1_1', 0.9864627566914622], ['tmp_2m_1_1', 0.9844648786308661], ['tmp_sfc1_1', 0.9826171735169642], ['tmin_2m2_1', 0.9794577781348043], ['tmin_2m3_1', 0.97879484323375], ['ulwrf_s2_1', 0.9707221719806952], ['tmax_2m2_1', 0.9637997824764319], ['tmp_2m_2_1', 0.9602996922677912], ['ulwrf_s3_1', 0.9578768560091268], ['tmp_sfc2_1', 0.9528200105400535]] [['Columna: tmax_2m2_1'], ['tmp_2m_2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9963595768447152], ['ulwrf_s3_1', 0.9943559569788851], ['ulwrf_s2_1', 0.9917542607726626], ['tmax_2m3_1', 0.9863610206089146], ['tmp_2m_3_1', 0.9829472636912582], ['tmin_2m2_1', 0.9819461458841241], ['tmin_2m3_1', 0.9819334574863472], ['tmin_2m4_1', 0.9790285687297058], ['tmp_2m_1_1', 0.9771397511742945], ['tmin_2m1_1', 0.9735093237780129], ['tmin_2m5_1', 0.9713195080332866], ['tmax_2m4_1', 0.9698857690834339], ['tmax_2m5_1', 0.9697541323437304], ['tmp_sfc1_1', 0.9697293778574791], ['tmp_sfc5_1', 0.9692885564165558], ['tmp_2m_5_1', 0.9675902215544175], ['tmp_2m_4_1', 0.9645870257681579], ['tmax_2m1_1', 0.9637997824764319], ['ulwrf_s1_1', 0.9606784450210313], ['ulwrf_s5_1', 0.9582268665523904], ['tmp_sfc3_1', 0.9540154794130064]] [['Columna: tmax_2m3_1'], ['tmp_2m_3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9973034556239585], ['tmax_2m4_1', 0.9937711137699543], ['tmax_2m5_1', 0.9931925682521422], ['tmp_2m_4_1', 0.9902875444682745], ['ulwrf_s3_1', 0.9882838798756105], ['tmp_2m_2_1', 0.9880399268817291], ['tmp_sfc2_1', 0.9874628996347063], ['tmax_2m2_1', 0.9863610206089146], ['tmin_2m5_1', 0.9849440628695366], ['tmp_sfc3_1', 0.9836268315982332], ['ulwrf_s5_1', 0.9831073628529786], ['tmp_2m_5_1', 0.9800734262132967], ['ulwrf_s4_1', 0.9784038012441758], ['tmp_sfc4_1', 0.977121003001589], ['tmp_sfc5_1', 0.9751782371847001], ['ulwrf_s2_1', 0.966905126635995], ['tmin_2m3_1', 0.9565532246066281], ['tmin_2m2_1', 0.9554925734198499]] [['Columna: tmax_2m4_1'], ['tmax_2m5_1', 0.999855391919824], ['tmp_2m_4_1', 0.9989670084606509], ['tmin_2m4_1', 0.9965267117336937], ['tmp_2m_3_1', 0.9954824569443925], ['tmax_2m3_1', 0.9937711137699543], ['ulwrf_s5_1', 0.9915793553887863], ['tmp_sfc4_1', 0.9903005164322048], ['tmin_2m5_1', 0.9888950622930037], ['ulwrf_s4_1', 0.987077087459209], ['tmp_2m_5_1', 0.9862698111992504], ['tmp_sfc3_1', 0.9857825429301692], ['tmp_sfc5_1', 0.9780961580272722], ['ulwrf_s3_1', 0.975826585247475], ['tmp_sfc2_1', 0.9734162968894661], ['tmp_2m_2_1', 0.9730783300205592], ['tmax_2m2_1', 0.9698857690834339]] [['Columna: tmax_2m5_1'], ['tmax_2m4_1', 0.999855391919824], ['tmp_2m_4_1', 0.9988753498975144], ['tmin_2m4_1', 0.9959215026917639], ['tmp_2m_3_1', 0.9948623637802105], ['tmax_2m3_1', 0.9931925682521422], ['ulwrf_s5_1', 0.99151293872598], ['tmp_sfc4_1', 0.9902303273948138], ['tmin_2m5_1', 0.9891405894645761], ['tmp_2m_5_1', 0.9872536923433799], ['ulwrf_s4_1', 0.9863521279974694], ['tmp_sfc3_1', 0.9848432369308402], ['tmp_sfc5_1', 0.9791492889290042], ['ulwrf_s3_1', 0.975536953838263], ['tmp_sfc2_1', 0.9732760733014082], ['tmp_2m_2_1', 0.9729567511523166], ['tmax_2m2_1', 0.9697541323437304]] [['Columna: tmin_2m1_1'], ['tmp_2m_1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9950949108277914], ['tmin_2m3_1', 0.9943408586923532], ['ulwrf_s1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9864627566914622], ['ulwrf_s2_1', 0.9814254257054185], ['tmax_2m2_1', 0.9735093237780129], ['tmp_2m_2_1', 0.9708467193246784], ['ulwrf_s3_1', 0.9654648681213764], ['tmp_sfc2_1', 0.9605896048258724]] [['Columna: tmin_2m2_1'], ['tmin_2m3_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9981819106379802], ['tmp_sfc1_1', 0.9960749292152973], ['tmin_2m1_1', 0.9950949108277914], ['ulwrf_s2_1', 0.9899721168841563], ['ulwrf_s1_1', 0.9850580289147287], ['tmax_2m2_1', 0.9819461458841241], ['tmp_2m_2_1', 0.9808603480091717], ['tmax_2m1_1', 0.9794577781348043], ['ulwrf_s3_1', 0.9743529570168544], ['tmp_sfc2_1', 0.9709866438853475], ['tmax_2m3_1', 0.9554925734198499], ['tmp_2m_3_1', 0.9516543862813964]] [['Columna: tmin_2m3_1'], ['tmin_2m2_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9973859703817012], ['tmp_sfc1_1', 0.9951903927420194], ['tmin_2m1_1', 0.9943408586923532], ['ulwrf_s2_1', 0.9897318251363909], ['ulwrf_s1_1', 0.9843176374301973], ['tmax_2m2_1', 0.9819334574863472], ['tmp_2m_2_1', 0.9813049765366397], ['tmax_2m1_1', 0.97879484323375], ['ulwrf_s3_1', 0.9751256747221002], ['tmp_sfc2_1', 0.9715874768684567], ['tmax_2m3_1', 0.9565532246066281], ['tmp_2m_3_1', 0.9537926818714694]] [['Columna: tmin_2m4_1'], ['tmp_2m_3_1', 0.9989480053798309], ['tmax_2m3_1', 0.9973034556239585], ['tmax_2m4_1', 0.9965267117336937], ['tmax_2m5_1', 0.9959215026917639], ['tmp_2m_4_1', 0.9955546180266533], ['tmin_2m5_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9881522932211056], ['tmp_sfc3_1', 0.985897294709102], ['tmp_sfc4_1', 0.9839686317112281], ['ulwrf_s4_1', 0.9837254094808376], ['tmp_2m_5_1', 0.9836800034039297], ['ulwrf_s3_1', 0.9834305304179148], ['tmp_2m_2_1', 0.9819192741530762], ['tmp_sfc2_1', 0.9818609853360799], ['tmax_2m2_1', 0.9790285687297058], ['tmp_sfc5_1', 0.9773521584222015], ['ulwrf_s2_1', 0.9587402415390521]] [['Columna: tmin_2m5_1'], ['tmp_2m_5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9960054258735697], ['tmp_2m_4_1', 0.9895327574093381], ['tmax_2m5_1', 0.9891405894645761], ['tmax_2m4_1', 0.9888950622930037], ['tmin_2m4_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9873222874147335], ['tmp_2m_3_1', 0.9863630955598195], ['tmax_2m3_1', 0.9849440628695366], ['tmp_sfc4_1', 0.9799245981499726], ['ulwrf_s3_1', 0.9771636437810954], ['tmp_sfc2_1', 0.975307217691161], ['tmp_2m_2_1', 0.973997775172227], ['ulwrf_s4_1', 0.9737488476761204], ['tmax_2m2_1', 0.9713195080332866], ['tmp_sfc3_1', 0.9704219757785946], ['ulwrf_s2_1', 0.9584808725553905]] [['Columna: tmp_2m_1_1'], ['tmin_2m2_1', 0.9981819106379802], ['tmin_2m1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9979448017549044], ['tmin_2m3_1', 0.9973859703817012], ['ulwrf_s1_1', 0.9897625110599417], ['ulwrf_s2_1', 0.9853437253651708], ['tmax_2m1_1', 0.9844648786308661], ['tmax_2m2_1', 0.9771397511742945], ['tmp_2m_2_1', 0.9744909003873826], ['ulwrf_s3_1', 0.968308266178652], ['tmp_sfc2_1', 0.9638593940360212]] [['Columna: tmp_2m_2_1'], ['tmax_2m2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.994870585784455], ['ulwrf_s2_1', 0.990609269111517], ['tmax_2m3_1', 0.9880399268817291], ['tmp_2m_3_1', 0.9856921834337115], ['tmin_2m4_1', 0.9819192741530762], ['tmin_2m3_1', 0.9813049765366397], ['tmin_2m2_1', 0.9808603480091717], ['tmp_2m_1_1', 0.9744909003873826], ['tmin_2m5_1', 0.973997775172227], ['tmax_2m4_1', 0.9730783300205592], ['tmax_2m5_1', 0.9729567511523166], ['tmp_sfc5_1', 0.9714528776518785], ['tmin_2m1_1', 0.9708467193246784], ['tmp_2m_5_1', 0.9702697100229983], ['tmp_2m_4_1', 0.9679046544965666], ['tmp_sfc1_1', 0.9668200402557023], ['ulwrf_s5_1', 0.960980140611558], ['tmax_2m1_1', 0.9602996922677912], ['ulwrf_s1_1', 0.9573813147342563], ['tmp_sfc3_1', 0.9570765900532514]] [['Columna: tmp_2m_3_1'], ['tmax_2m3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9989480053798309], ['tmax_2m4_1', 0.9954824569443925], ['tmax_2m5_1', 0.9948623637802105], ['tmp_2m_4_1', 0.9925508354894933], ['ulwrf_s3_1', 0.986475159576972], ['tmin_2m5_1', 0.9863630955598195], ['tmp_2m_2_1', 0.9856921834337115], ['tmp_sfc3_1', 0.9856020769129077], ['tmp_sfc2_1', 0.9853721430018374], ['ulwrf_s5_1', 0.9850287211532148], ['tmax_2m2_1', 0.9829472636912582], ['tmp_2m_5_1', 0.9813875596975576], ['ulwrf_s4_1', 0.9807390701133252], ['tmp_sfc4_1', 0.9796642142509698], ['tmp_sfc5_1', 0.9758811197330793], ['ulwrf_s2_1', 0.9631310421561378], ['tmin_2m3_1', 0.9537926818714694], ['tmin_2m2_1', 0.9516543862813964]] [['Columna: tmp_2m_4_1'], ['tmax_2m4_1', 0.9989670084606509], ['tmax_2m5_1', 0.9988753498975144], ['tmin_2m4_1', 0.9955546180266533], ['ulwrf_s5_1', 0.9926662202081515], ['tmp_2m_3_1', 0.9925508354894933], ['tmp_sfc4_1', 0.9924835965861123], ['tmax_2m3_1', 0.9902875444682745], ['tmin_2m5_1', 0.9895327574093381], ['ulwrf_s4_1', 0.9875947598774877], ['tmp_2m_5_1', 0.9871404965220815], ['tmp_sfc3_1', 0.9838690005793208], ['tmp_sfc5_1', 0.9781934487046426], ['ulwrf_s3_1', 0.9712226930987012], ['tmp_sfc2_1', 0.9684774894714101], ['tmp_2m_2_1', 0.9679046544965666], ['tmax_2m2_1', 0.9645870257681579]] [['Columna: tmp_2m_5_1'], ['tmin_2m5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9980075740584928], ['tmax_2m5_1', 0.9872536923433799], ['tmp_2m_4_1', 0.9871404965220815], ['tmax_2m4_1', 0.9862698111992504], ['ulwrf_s5_1', 0.9852702329370912], ['tmin_2m4_1', 0.9836800034039297], ['tmp_2m_3_1', 0.9813875596975576], ['tmax_2m3_1', 0.9800734262132967], ['tmp_sfc4_1', 0.9781233238955636], ['ulwrf_s3_1', 0.9731828651939985], ['tmp_sfc2_1', 0.971798288488127], ['tmp_2m_2_1', 0.9702697100229983], ['ulwrf_s4_1', 0.9695091232286572], ['tmax_2m2_1', 0.9675902215544175], ['tmp_sfc3_1', 0.9646544198978636], ['ulwrf_s2_1', 0.9560882131646383]] [['Columna: tmp_sfc1_1'], ['tmp_2m_1_1', 0.9979448017549044], ['tmin_2m1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9960749292152973], ['tmin_2m3_1', 0.9951903927420194], ['ulwrf_s1_1', 0.9919281097281679], ['ulwrf_s2_1', 0.9842676480547997], ['tmax_2m1_1', 0.9826171735169642], ['tmax_2m2_1', 0.9697293778574791], ['tmp_2m_2_1', 0.9668200402557023], ['ulwrf_s3_1', 0.9621315029521358], ['tmp_sfc2_1', 0.957188784436685]] [['Columna: tmp_sfc2_1'], ['tmp_2m_2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.996855293888584], ['tmax_2m2_1', 0.9963595768447152], ['ulwrf_s2_1', 0.9879322271553737], ['tmax_2m3_1', 0.9874628996347063], ['tmp_2m_3_1', 0.9853721430018374], ['tmin_2m4_1', 0.9818609853360799], ['tmin_2m5_1', 0.975307217691161], ['tmp_sfc5_1', 0.9740593607013978], ['tmax_2m4_1', 0.9734162968894661], ['tmax_2m5_1', 0.9732760733014082], ['tmp_2m_5_1', 0.971798288488127], ['tmin_2m3_1', 0.9715874768684567], ['tmin_2m2_1', 0.9709866438853475], ['tmp_2m_4_1', 0.9684774894714101], ['ulwrf_s5_1', 0.9670339151339817], ['tmp_sfc3_1', 0.9649823936827109], ['tmp_2m_1_1', 0.9638593940360212], ['tmin_2m1_1', 0.9605896048258724], ['ulwrf_s4_1', 0.9572236269499521], ['tmp_sfc1_1', 0.957188784436685], ['tmp_sfc4_1', 0.9555363809053586], ['tmax_2m1_1', 0.9528200105400535]] [['Columna: tmp_sfc3_1'], ['ulwrf_s4_1', 0.9947627885687921], ['ulwrf_s5_1', 0.9892581590922543], ['tmp_sfc4_1', 0.9884745475700606], ['tmin_2m4_1', 0.985897294709102], ['tmax_2m4_1', 0.9857825429301692], ['tmp_2m_3_1', 0.9856020769129077], ['tmax_2m5_1', 0.9848432369308402], ['tmp_2m_4_1', 0.9838690005793208], ['tmax_2m3_1', 0.9836268315982332], ['tmin_2m5_1', 0.9704219757785946], ['ulwrf_s3_1', 0.9698096931439735], ['tmp_sfc2_1', 0.9649823936827109], ['tmp_2m_5_1', 0.9646544198978636], ['tmp_2m_2_1', 0.9570765900532514], ['tmp_sfc5_1', 0.9562755066656142], ['tmax_2m2_1', 0.9540154794130064]] [['Columna: tmp_sfc4_1'], ['ulwrf_s5_1', 0.996612200039398], ['ulwrf_s4_1', 0.9957411309514121], ['tmp_2m_4_1', 0.9924835965861123], ['tmax_2m4_1', 0.9903005164322048], ['tmax_2m5_1', 0.9902303273948138], ['tmp_sfc3_1', 0.9884745475700606], ['tmin_2m4_1', 0.9839686317112281], ['tmin_2m5_1', 0.9799245981499726], ['tmp_2m_3_1', 0.9796642142509698], ['tmp_2m_5_1', 0.9781233238955636], ['tmax_2m3_1', 0.977121003001589], ['tmp_sfc5_1', 0.9687614236330414], ['ulwrf_s3_1', 0.9604279752910915], ['tmp_sfc2_1', 0.9555363809053586]] [['Columna: tmp_sfc5_1'], ['tmp_2m_5_1', 0.9980075740584928], ['tmin_2m5_1', 0.9960054258735697], ['tmax_2m5_1', 0.9791492889290042], ['ulwrf_s5_1', 0.9789322162596626], ['tmp_2m_4_1', 0.9781934487046426], ['tmax_2m4_1', 0.9780961580272722], ['tmin_2m4_1', 0.9773521584222015], ['tmp_2m_3_1', 0.9758811197330793], ['tmax_2m3_1', 0.9751782371847001], ['ulwrf_s3_1', 0.9744478191517459], ['tmp_sfc2_1', 0.9740593607013978], ['tmp_2m_2_1', 0.9714528776518785], ['tmax_2m2_1', 0.9692885564165558], ['tmp_sfc4_1', 0.9687614236330414], ['ulwrf_s2_1', 0.9619766387479004], ['ulwrf_s4_1', 0.960457161005563], ['tmp_sfc3_1', 0.9562755066656142]] [['Columna: ulwrf_s1_1'], ['tmin_2m1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9925465923917536], ['tmp_sfc1_1', 0.9919281097281679], ['tmp_2m_1_1', 0.9897625110599417], ['tmin_2m2_1', 0.9850580289147287], ['tmin_2m3_1', 0.9843176374301973], ['ulwrf_s2_1', 0.9762674995860944], ['tmax_2m2_1', 0.9606784450210313], ['ulwrf_s3_1', 0.9574576992187499], ['tmp_2m_2_1', 0.9573813147342563]] [['Columna: ulwrf_s2_1'], ['tmax_2m2_1', 0.9917542607726626], ['tmp_2m_2_1', 0.990609269111517], ['ulwrf_s3_1', 0.990564307766117], ['tmin_2m2_1', 0.9899721168841563], ['tmin_2m3_1', 0.9897318251363909], ['tmp_sfc2_1', 0.9879322271553737], ['tmp_2m_1_1', 0.9853437253651708], ['tmp_sfc1_1', 0.9842676480547997], ['tmin_2m1_1', 0.9814254257054185], ['ulwrf_s1_1', 0.9762674995860944], ['tmax_2m1_1', 0.9707221719806952], ['tmax_2m3_1', 0.966905126635995], ['tmp_2m_3_1', 0.9631310421561378], ['tmp_sfc5_1', 0.9619766387479004], ['tmin_2m4_1', 0.9587402415390521], ['tmin_2m5_1', 0.9584808725553905], ['tmp_2m_5_1', 0.9560882131646383]] [['Columna: ulwrf_s3_1'], ['tmp_sfc2_1', 0.996855293888584], ['tmp_2m_2_1', 0.994870585784455], ['tmax_2m2_1', 0.9943559569788851], ['ulwrf_s2_1', 0.990564307766117], ['tmax_2m3_1', 0.9882838798756105], ['tmp_2m_3_1', 0.986475159576972], ['tmin_2m4_1', 0.9834305304179148], ['tmin_2m5_1', 0.9771636437810954], ['tmax_2m4_1', 0.975826585247475], ['tmax_2m5_1', 0.975536953838263], ['tmin_2m3_1', 0.9751256747221002], ['tmp_sfc5_1', 0.9744478191517459], ['tmin_2m2_1', 0.9743529570168544], ['ulwrf_s5_1', 0.9734148064192735], ['tmp_2m_5_1', 0.9731828651939985], ['tmp_2m_4_1', 0.9712226930987012], ['tmp_sfc3_1', 0.9698096931439735], ['tmp_2m_1_1', 0.968308266178652], ['tmin_2m1_1', 0.9654648681213764], ['ulwrf_s4_1', 0.9651706956885256], ['tmp_sfc1_1', 0.9621315029521358], ['tmp_sfc4_1', 0.9604279752910915], ['tmax_2m1_1', 0.9578768560091268], ['ulwrf_s1_1', 0.9574576992187499]] [['Columna: ulwrf_s4_1'], ['ulwrf_s5_1', 0.9963430558611763], ['tmp_sfc4_1', 0.9957411309514121], ['tmp_sfc3_1', 0.9947627885687921], ['tmp_2m_4_1', 0.9875947598774877], ['tmax_2m4_1', 0.987077087459209], ['tmax_2m5_1', 0.9863521279974694], ['tmin_2m4_1', 0.9837254094808376], ['tmp_2m_3_1', 0.9807390701133252], ['tmax_2m3_1', 0.9784038012441758], ['tmin_2m5_1', 0.9737488476761204], ['tmp_2m_5_1', 0.9695091232286572], ['ulwrf_s3_1', 0.9651706956885256], ['tmp_sfc5_1', 0.960457161005563], ['tmp_sfc2_1', 0.9572236269499521]] [['Columna: ulwrf_s5_1'], ['tmp_sfc4_1', 0.996612200039398], ['ulwrf_s4_1', 0.9963430558611763], ['tmp_2m_4_1', 0.9926662202081515], ['tmax_2m4_1', 0.9915793553887863], ['tmax_2m5_1', 0.99151293872598], ['tmp_sfc3_1', 0.9892581590922543], ['tmin_2m4_1', 0.9881522932211056], ['tmin_2m5_1', 0.9873222874147335], ['tmp_2m_5_1', 0.9852702329370912], ['tmp_2m_3_1', 0.9850287211532148], ['tmax_2m3_1', 0.9831073628529786], ['tmp_sfc5_1', 0.9789322162596626], ['ulwrf_s3_1', 0.9734148064192735], ['tmp_sfc2_1', 0.9670339151339817], ['tmp_2m_2_1', 0.960980140611558], ['tmax_2m2_1', 0.9582268665523904]] [['Columna: ulwrf_t1_1']] [['Columna: ulwrf_t2_1'], ['ulwrf_t3_1', 0.9744666921198298]] [['Columna: ulwrf_t3_1'], ['ulwrf_t2_1', 0.9744666921198298]] [['Columna: ulwrf_t4_1'], ['ulwrf_t5_1', 0.9755542908956468]] [['Columna: ulwrf_t5_1'], ['ulwrf_t4_1', 0.9755542908956468]] [['Columna: uswrf_s1_1']] [['Columna: uswrf_s2_1'], ['dswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9591814530708258]] [['Columna: uswrf_s3_1']] [['Columna: uswrf_s4_1'], ['uswrf_s5_1', 0.9562280634672189]] [['Columna: uswrf_s5_1'], ['uswrf_s4_1', 0.9562280634672189]] [['Columna: salida']] [[['Columna: apcp_sf1_1']], [['Columna: apcp_sf2_1']], [['Columna: apcp_sf3_1']], [['Columna: apcp_sf4_1']], [['Columna: apcp_sf5_1']], [['Columna: dlwrf_s1_1'], ['dlwrf_s2_1', 0.9650067922254768], ['dlwrf_s3_1', 0.9547817730760655]], [['Columna: dlwrf_s2_1'], ['dlwrf_s3_1', 0.993701215706055], ['dlwrf_s1_1', 0.9650067922254768]], [['Columna: dlwrf_s3_1'], ['dlwrf_s2_1', 0.993701215706055], ['dlwrf_s4_1', 0.9659874690575408], ['dlwrf_s5_1', 0.9552712673845433], ['dlwrf_s1_1', 0.9547817730760655]], [['Columna: dlwrf_s4_1'], ['dlwrf_s5_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9659874690575408]], [['Columna: dlwrf_s5_1'], ['dlwrf_s4_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9552712673845433]], [['Columna: dswrf_s1_1']], [['Columna: dswrf_s2_1'], ['uswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9503896354343679]], [['Columna: dswrf_s3_1'], ['uswrf_s2_1', 0.9591814530708258], ['dswrf_s2_1', 0.9503896354343679]], [['Columna: dswrf_s4_1'], ['dswrf_s5_1', 0.982758557897581]], [['Columna: dswrf_s5_1'], ['dswrf_s4_1', 0.982758557897581]], [['Columna: pres_ms1_1'], ['pres_ms2_1', 0.9879236602955379], ['pres_ms3_1', 0.956852960202746]], [['Columna: pres_ms2_1'], ['pres_ms1_1', 0.9879236602955379], ['pres_ms3_1', 0.9869377705171734], ['pres_ms4_1', 0.9536176398645005]], [['Columna: pres_ms3_1'], ['pres_ms2_1', 0.9869377705171734], ['pres_ms4_1', 0.9866602703072012], ['pres_ms1_1', 0.956852960202746], ['pres_ms5_1', 0.9538147697170144]], [['Columna: pres_ms4_1'], ['pres_ms3_1', 0.9866602703072012], ['pres_ms5_1', 0.9851755074525863], ['pres_ms2_1', 0.9536176398645005]], [['Columna: pres_ms5_1'], ['pres_ms4_1', 0.9851755074525863], ['pres_ms3_1', 0.9538147697170144]], [['Columna: pwat_ea1_1'], ['pwat_ea2_1', 0.9859484994851248], ['pwat_ea3_1', 0.9577107162594556]], [['Columna: pwat_ea2_1'], ['pwat_ea3_1', 0.9874259658433963], ['pwat_ea1_1', 0.9859484994851248], ['pwat_ea4_1', 0.9618712300670131]], [['Columna: pwat_ea3_1'], ['pwat_ea4_1', 0.9880603787665849], ['pwat_ea2_1', 0.9874259658433963], ['pwat_ea5_1', 0.9616424908340101], ['pwat_ea1_1', 0.9577107162594556]], [['Columna: pwat_ea4_1'], ['pwat_ea3_1', 0.9880603787665849], ['pwat_ea5_1', 0.986763801908917], ['pwat_ea2_1', 0.9618712300670131]], [['Columna: pwat_ea5_1'], ['pwat_ea4_1', 0.986763801908917], ['pwat_ea3_1', 0.9616424908340101]], [['Columna: spfh_2m1_1'], ['spfh_2m2_1', 0.9742691195680059]], [['Columna: spfh_2m2_1'], ['spfh_2m3_1', 0.9846069576918387], ['spfh_2m1_1', 0.9742691195680059], ['spfh_2m4_1', 0.9600698332225309]], [['Columna: spfh_2m3_1'], ['spfh_2m4_1', 0.9891201306737782], ['spfh_2m2_1', 0.9846069576918387], ['spfh_2m5_1', 0.9771699520274281]], [['Columna: spfh_2m4_1'], ['spfh_2m5_1', 0.9904262248914517], ['spfh_2m3_1', 0.9891201306737782], ['spfh_2m2_1', 0.9600698332225309]], [['Columna: spfh_2m5_1'], ['spfh_2m4_1', 0.9904262248914517], ['spfh_2m3_1', 0.9771699520274281]], [['Columna: tcdc_ea1_1'], ['tcolc_e1_1', 0.9999826963362115]], [['Columna: tcdc_ea2_1'], ['tcolc_e2_1', 0.9999837132775715]], [['Columna: tcdc_ea3_1'], ['tcolc_e3_1', 0.9999845616560729]], [['Columna: tcdc_ea4_1'], ['tcolc_e4_1', 0.999984785893167]], [['Columna: tcdc_ea5_1'], ['tcolc_e5_1', 0.9999746391911669]], [['Columna: tcolc_e1_1'], ['tcdc_ea1_1', 0.9999826963362115]], [['Columna: tcolc_e2_1'], ['tcdc_ea2_1', 0.9999837132775715]], [['Columna: tcolc_e3_1'], ['tcdc_ea3_1', 0.9999845616560729]], [['Columna: tcolc_e4_1'], ['tcdc_ea4_1', 0.999984785893167]], [['Columna: tcolc_e5_1'], ['tcdc_ea5_1', 0.9999746391911669]], [['Columna: tmax_2m1_1'], ['ulwrf_s1_1', 0.9925465923917536], ['tmin_2m1_1', 0.9864627566914622], ['tmp_2m_1_1', 0.9844648786308661], ['tmp_sfc1_1', 0.9826171735169642], ['tmin_2m2_1', 0.9794577781348043], ['tmin_2m3_1', 0.97879484323375], ['ulwrf_s2_1', 0.9707221719806952], ['tmax_2m2_1', 0.9637997824764319], ['tmp_2m_2_1', 0.9602996922677912], ['ulwrf_s3_1', 0.9578768560091268], ['tmp_sfc2_1', 0.9528200105400535]], [['Columna: tmax_2m2_1'], ['tmp_2m_2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9963595768447152], ['ulwrf_s3_1', 0.9943559569788851], ['ulwrf_s2_1', 0.9917542607726626], ['tmax_2m3_1', 0.9863610206089146], ['tmp_2m_3_1', 0.9829472636912582], ['tmin_2m2_1', 0.9819461458841241], ['tmin_2m3_1', 0.9819334574863472], ['tmin_2m4_1', 0.9790285687297058], ['tmp_2m_1_1', 0.9771397511742945], ['tmin_2m1_1', 0.9735093237780129], ['tmin_2m5_1', 0.9713195080332866], ['tmax_2m4_1', 0.9698857690834339], ['tmax_2m5_1', 0.9697541323437304], ['tmp_sfc1_1', 0.9697293778574791], ['tmp_sfc5_1', 0.9692885564165558], ['tmp_2m_5_1', 0.9675902215544175], ['tmp_2m_4_1', 0.9645870257681579], ['tmax_2m1_1', 0.9637997824764319], ['ulwrf_s1_1', 0.9606784450210313], ['ulwrf_s5_1', 0.9582268665523904], ['tmp_sfc3_1', 0.9540154794130064]], [['Columna: tmax_2m3_1'], ['tmp_2m_3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9973034556239585], ['tmax_2m4_1', 0.9937711137699543], ['tmax_2m5_1', 0.9931925682521422], ['tmp_2m_4_1', 0.9902875444682745], ['ulwrf_s3_1', 0.9882838798756105], ['tmp_2m_2_1', 0.9880399268817291], ['tmp_sfc2_1', 0.9874628996347063], ['tmax_2m2_1', 0.9863610206089146], ['tmin_2m5_1', 0.9849440628695366], ['tmp_sfc3_1', 0.9836268315982332], ['ulwrf_s5_1', 0.9831073628529786], ['tmp_2m_5_1', 0.9800734262132967], ['ulwrf_s4_1', 0.9784038012441758], ['tmp_sfc4_1', 0.977121003001589], ['tmp_sfc5_1', 0.9751782371847001], ['ulwrf_s2_1', 0.966905126635995], ['tmin_2m3_1', 0.9565532246066281], ['tmin_2m2_1', 0.9554925734198499]], [['Columna: tmax_2m4_1'], ['tmax_2m5_1', 0.999855391919824], ['tmp_2m_4_1', 0.9989670084606509], ['tmin_2m4_1', 0.9965267117336937], ['tmp_2m_3_1', 0.9954824569443925], ['tmax_2m3_1', 0.9937711137699543], ['ulwrf_s5_1', 0.9915793553887863], ['tmp_sfc4_1', 0.9903005164322048], ['tmin_2m5_1', 0.9888950622930037], ['ulwrf_s4_1', 0.987077087459209], ['tmp_2m_5_1', 0.9862698111992504], ['tmp_sfc3_1', 0.9857825429301692], ['tmp_sfc5_1', 0.9780961580272722], ['ulwrf_s3_1', 0.975826585247475], ['tmp_sfc2_1', 0.9734162968894661], ['tmp_2m_2_1', 0.9730783300205592], ['tmax_2m2_1', 0.9698857690834339]], [['Columna: tmax_2m5_1'], ['tmax_2m4_1', 0.999855391919824], ['tmp_2m_4_1', 0.9988753498975144], ['tmin_2m4_1', 0.9959215026917639], ['tmp_2m_3_1', 0.9948623637802105], ['tmax_2m3_1', 0.9931925682521422], ['ulwrf_s5_1', 0.99151293872598], ['tmp_sfc4_1', 0.9902303273948138], ['tmin_2m5_1', 0.9891405894645761], ['tmp_2m_5_1', 0.9872536923433799], ['ulwrf_s4_1', 0.9863521279974694], ['tmp_sfc3_1', 0.9848432369308402], ['tmp_sfc5_1', 0.9791492889290042], ['ulwrf_s3_1', 0.975536953838263], ['tmp_sfc2_1', 0.9732760733014082], ['tmp_2m_2_1', 0.9729567511523166], ['tmax_2m2_1', 0.9697541323437304]], [['Columna: tmin_2m1_1'], ['tmp_2m_1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9950949108277914], ['tmin_2m3_1', 0.9943408586923532], ['ulwrf_s1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9864627566914622], ['ulwrf_s2_1', 0.9814254257054185], ['tmax_2m2_1', 0.9735093237780129], ['tmp_2m_2_1', 0.9708467193246784], ['ulwrf_s3_1', 0.9654648681213764], ['tmp_sfc2_1', 0.9605896048258724]], [['Columna: tmin_2m2_1'], ['tmin_2m3_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9981819106379802], ['tmp_sfc1_1', 0.9960749292152973], ['tmin_2m1_1', 0.9950949108277914], ['ulwrf_s2_1', 0.9899721168841563], ['ulwrf_s1_1', 0.9850580289147287], ['tmax_2m2_1', 0.9819461458841241], ['tmp_2m_2_1', 0.9808603480091717], ['tmax_2m1_1', 0.9794577781348043], ['ulwrf_s3_1', 0.9743529570168544], ['tmp_sfc2_1', 0.9709866438853475], ['tmax_2m3_1', 0.9554925734198499], ['tmp_2m_3_1', 0.9516543862813964]], [['Columna: tmin_2m3_1'], ['tmin_2m2_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9973859703817012], ['tmp_sfc1_1', 0.9951903927420194], ['tmin_2m1_1', 0.9943408586923532], ['ulwrf_s2_1', 0.9897318251363909], ['ulwrf_s1_1', 0.9843176374301973], ['tmax_2m2_1', 0.9819334574863472], ['tmp_2m_2_1', 0.9813049765366397], ['tmax_2m1_1', 0.97879484323375], ['ulwrf_s3_1', 0.9751256747221002], ['tmp_sfc2_1', 0.9715874768684567], ['tmax_2m3_1', 0.9565532246066281], ['tmp_2m_3_1', 0.9537926818714694]], [['Columna: tmin_2m4_1'], ['tmp_2m_3_1', 0.9989480053798309], ['tmax_2m3_1', 0.9973034556239585], ['tmax_2m4_1', 0.9965267117336937], ['tmax_2m5_1', 0.9959215026917639], ['tmp_2m_4_1', 0.9955546180266533], ['tmin_2m5_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9881522932211056], ['tmp_sfc3_1', 0.985897294709102], ['tmp_sfc4_1', 0.9839686317112281], ['ulwrf_s4_1', 0.9837254094808376], ['tmp_2m_5_1', 0.9836800034039297], ['ulwrf_s3_1', 0.9834305304179148], ['tmp_2m_2_1', 0.9819192741530762], ['tmp_sfc2_1', 0.9818609853360799], ['tmax_2m2_1', 0.9790285687297058], ['tmp_sfc5_1', 0.9773521584222015], ['ulwrf_s2_1', 0.9587402415390521]], [['Columna: tmin_2m5_1'], ['tmp_2m_5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9960054258735697], ['tmp_2m_4_1', 0.9895327574093381], ['tmax_2m5_1', 0.9891405894645761], ['tmax_2m4_1', 0.9888950622930037], ['tmin_2m4_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9873222874147335], ['tmp_2m_3_1', 0.9863630955598195], ['tmax_2m3_1', 0.9849440628695366], ['tmp_sfc4_1', 0.9799245981499726], ['ulwrf_s3_1', 0.9771636437810954], ['tmp_sfc2_1', 0.975307217691161], ['tmp_2m_2_1', 0.973997775172227], ['ulwrf_s4_1', 0.9737488476761204], ['tmax_2m2_1', 0.9713195080332866], ['tmp_sfc3_1', 0.9704219757785946], ['ulwrf_s2_1', 0.9584808725553905]], [['Columna: tmp_2m_1_1'], ['tmin_2m2_1', 0.9981819106379802], ['tmin_2m1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9979448017549044], ['tmin_2m3_1', 0.9973859703817012], ['ulwrf_s1_1', 0.9897625110599417], ['ulwrf_s2_1', 0.9853437253651708], ['tmax_2m1_1', 0.9844648786308661], ['tmax_2m2_1', 0.9771397511742945], ['tmp_2m_2_1', 0.9744909003873826], ['ulwrf_s3_1', 0.968308266178652], ['tmp_sfc2_1', 0.9638593940360212]], [['Columna: tmp_2m_2_1'], ['tmax_2m2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.994870585784455], ['ulwrf_s2_1', 0.990609269111517], ['tmax_2m3_1', 0.9880399268817291], ['tmp_2m_3_1', 0.9856921834337115], ['tmin_2m4_1', 0.9819192741530762], ['tmin_2m3_1', 0.9813049765366397], ['tmin_2m2_1', 0.9808603480091717], ['tmp_2m_1_1', 0.9744909003873826], ['tmin_2m5_1', 0.973997775172227], ['tmax_2m4_1', 0.9730783300205592], ['tmax_2m5_1', 0.9729567511523166], ['tmp_sfc5_1', 0.9714528776518785], ['tmin_2m1_1', 0.9708467193246784], ['tmp_2m_5_1', 0.9702697100229983], ['tmp_2m_4_1', 0.9679046544965666], ['tmp_sfc1_1', 0.9668200402557023], ['ulwrf_s5_1', 0.960980140611558], ['tmax_2m1_1', 0.9602996922677912], ['ulwrf_s1_1', 0.9573813147342563], ['tmp_sfc3_1', 0.9570765900532514]], [['Columna: tmp_2m_3_1'], ['tmax_2m3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9989480053798309], ['tmax_2m4_1', 0.9954824569443925], ['tmax_2m5_1', 0.9948623637802105], ['tmp_2m_4_1', 0.9925508354894933], ['ulwrf_s3_1', 0.986475159576972], ['tmin_2m5_1', 0.9863630955598195], ['tmp_2m_2_1', 0.9856921834337115], ['tmp_sfc3_1', 0.9856020769129077], ['tmp_sfc2_1', 0.9853721430018374], ['ulwrf_s5_1', 0.9850287211532148], ['tmax_2m2_1', 0.9829472636912582], ['tmp_2m_5_1', 0.9813875596975576], ['ulwrf_s4_1', 0.9807390701133252], ['tmp_sfc4_1', 0.9796642142509698], ['tmp_sfc5_1', 0.9758811197330793], ['ulwrf_s2_1', 0.9631310421561378], ['tmin_2m3_1', 0.9537926818714694], ['tmin_2m2_1', 0.9516543862813964]], [['Columna: tmp_2m_4_1'], ['tmax_2m4_1', 0.9989670084606509], ['tmax_2m5_1', 0.9988753498975144], ['tmin_2m4_1', 0.9955546180266533], ['ulwrf_s5_1', 0.9926662202081515], ['tmp_2m_3_1', 0.9925508354894933], ['tmp_sfc4_1', 0.9924835965861123], ['tmax_2m3_1', 0.9902875444682745], ['tmin_2m5_1', 0.9895327574093381], ['ulwrf_s4_1', 0.9875947598774877], ['tmp_2m_5_1', 0.9871404965220815], ['tmp_sfc3_1', 0.9838690005793208], ['tmp_sfc5_1', 0.9781934487046426], ['ulwrf_s3_1', 0.9712226930987012], ['tmp_sfc2_1', 0.9684774894714101], ['tmp_2m_2_1', 0.9679046544965666], ['tmax_2m2_1', 0.9645870257681579]], [['Columna: tmp_2m_5_1'], ['tmin_2m5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9980075740584928], ['tmax_2m5_1', 0.9872536923433799], ['tmp_2m_4_1', 0.9871404965220815], ['tmax_2m4_1', 0.9862698111992504], ['ulwrf_s5_1', 0.9852702329370912], ['tmin_2m4_1', 0.9836800034039297], ['tmp_2m_3_1', 0.9813875596975576], ['tmax_2m3_1', 0.9800734262132967], ['tmp_sfc4_1', 0.9781233238955636], ['ulwrf_s3_1', 0.9731828651939985], ['tmp_sfc2_1', 0.971798288488127], ['tmp_2m_2_1', 0.9702697100229983], ['ulwrf_s4_1', 0.9695091232286572], ['tmax_2m2_1', 0.9675902215544175], ['tmp_sfc3_1', 0.9646544198978636], ['ulwrf_s2_1', 0.9560882131646383]], [['Columna: tmp_sfc1_1'], ['tmp_2m_1_1', 0.9979448017549044], ['tmin_2m1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9960749292152973], ['tmin_2m3_1', 0.9951903927420194], ['ulwrf_s1_1', 0.9919281097281679], ['ulwrf_s2_1', 0.9842676480547997], ['tmax_2m1_1', 0.9826171735169642], ['tmax_2m2_1', 0.9697293778574791], ['tmp_2m_2_1', 0.9668200402557023], ['ulwrf_s3_1', 0.9621315029521358], ['tmp_sfc2_1', 0.957188784436685]], [['Columna: tmp_sfc2_1'], ['tmp_2m_2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.996855293888584], ['tmax_2m2_1', 0.9963595768447152], ['ulwrf_s2_1', 0.9879322271553737], ['tmax_2m3_1', 0.9874628996347063], ['tmp_2m_3_1', 0.9853721430018374], ['tmin_2m4_1', 0.9818609853360799], ['tmin_2m5_1', 0.975307217691161], ['tmp_sfc5_1', 0.9740593607013978], ['tmax_2m4_1', 0.9734162968894661], ['tmax_2m5_1', 0.9732760733014082], ['tmp_2m_5_1', 0.971798288488127], ['tmin_2m3_1', 0.9715874768684567], ['tmin_2m2_1', 0.9709866438853475], ['tmp_2m_4_1', 0.9684774894714101], ['ulwrf_s5_1', 0.9670339151339817], ['tmp_sfc3_1', 0.9649823936827109], ['tmp_2m_1_1', 0.9638593940360212], ['tmin_2m1_1', 0.9605896048258724], ['ulwrf_s4_1', 0.9572236269499521], ['tmp_sfc1_1', 0.957188784436685], ['tmp_sfc4_1', 0.9555363809053586], ['tmax_2m1_1', 0.9528200105400535]], [['Columna: tmp_sfc3_1'], ['ulwrf_s4_1', 0.9947627885687921], ['ulwrf_s5_1', 0.9892581590922543], ['tmp_sfc4_1', 0.9884745475700606], ['tmin_2m4_1', 0.985897294709102], ['tmax_2m4_1', 0.9857825429301692], ['tmp_2m_3_1', 0.9856020769129077], ['tmax_2m5_1', 0.9848432369308402], ['tmp_2m_4_1', 0.9838690005793208], ['tmax_2m3_1', 0.9836268315982332], ['tmin_2m5_1', 0.9704219757785946], ['ulwrf_s3_1', 0.9698096931439735], ['tmp_sfc2_1', 0.9649823936827109], ['tmp_2m_5_1', 0.9646544198978636], ['tmp_2m_2_1', 0.9570765900532514], ['tmp_sfc5_1', 0.9562755066656142], ['tmax_2m2_1', 0.9540154794130064]], [['Columna: tmp_sfc4_1'], ['ulwrf_s5_1', 0.996612200039398], ['ulwrf_s4_1', 0.9957411309514121], ['tmp_2m_4_1', 0.9924835965861123], ['tmax_2m4_1', 0.9903005164322048], ['tmax_2m5_1', 0.9902303273948138], ['tmp_sfc3_1', 0.9884745475700606], ['tmin_2m4_1', 0.9839686317112281], ['tmin_2m5_1', 0.9799245981499726], ['tmp_2m_3_1', 0.9796642142509698], ['tmp_2m_5_1', 0.9781233238955636], ['tmax_2m3_1', 0.977121003001589], ['tmp_sfc5_1', 0.9687614236330414], ['ulwrf_s3_1', 0.9604279752910915], ['tmp_sfc2_1', 0.9555363809053586]], [['Columna: tmp_sfc5_1'], ['tmp_2m_5_1', 0.9980075740584928], ['tmin_2m5_1', 0.9960054258735697], ['tmax_2m5_1', 0.9791492889290042], ['ulwrf_s5_1', 0.9789322162596626], ['tmp_2m_4_1', 0.9781934487046426], ['tmax_2m4_1', 0.9780961580272722], ['tmin_2m4_1', 0.9773521584222015], ['tmp_2m_3_1', 0.9758811197330793], ['tmax_2m3_1', 0.9751782371847001], ['ulwrf_s3_1', 0.9744478191517459], ['tmp_sfc2_1', 0.9740593607013978], ['tmp_2m_2_1', 0.9714528776518785], ['tmax_2m2_1', 0.9692885564165558], ['tmp_sfc4_1', 0.9687614236330414], ['ulwrf_s2_1', 0.9619766387479004], ['ulwrf_s4_1', 0.960457161005563], ['tmp_sfc3_1', 0.9562755066656142]], [['Columna: ulwrf_s1_1'], ['tmin_2m1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9925465923917536], ['tmp_sfc1_1', 0.9919281097281679], ['tmp_2m_1_1', 0.9897625110599417], ['tmin_2m2_1', 0.9850580289147287], ['tmin_2m3_1', 0.9843176374301973], ['ulwrf_s2_1', 0.9762674995860944], ['tmax_2m2_1', 0.9606784450210313], ['ulwrf_s3_1', 0.9574576992187499], ['tmp_2m_2_1', 0.9573813147342563]], [['Columna: ulwrf_s2_1'], ['tmax_2m2_1', 0.9917542607726626], ['tmp_2m_2_1', 0.990609269111517], ['ulwrf_s3_1', 0.990564307766117], ['tmin_2m2_1', 0.9899721168841563], ['tmin_2m3_1', 0.9897318251363909], ['tmp_sfc2_1', 0.9879322271553737], ['tmp_2m_1_1', 0.9853437253651708], ['tmp_sfc1_1', 0.9842676480547997], ['tmin_2m1_1', 0.9814254257054185], ['ulwrf_s1_1', 0.9762674995860944], ['tmax_2m1_1', 0.9707221719806952], ['tmax_2m3_1', 0.966905126635995], ['tmp_2m_3_1', 0.9631310421561378], ['tmp_sfc5_1', 0.9619766387479004], ['tmin_2m4_1', 0.9587402415390521], ['tmin_2m5_1', 0.9584808725553905], ['tmp_2m_5_1', 0.9560882131646383]], [['Columna: ulwrf_s3_1'], ['tmp_sfc2_1', 0.996855293888584], ['tmp_2m_2_1', 0.994870585784455], ['tmax_2m2_1', 0.9943559569788851], ['ulwrf_s2_1', 0.990564307766117], ['tmax_2m3_1', 0.9882838798756105], ['tmp_2m_3_1', 0.986475159576972], ['tmin_2m4_1', 0.9834305304179148], ['tmin_2m5_1', 0.9771636437810954], ['tmax_2m4_1', 0.975826585247475], ['tmax_2m5_1', 0.975536953838263], ['tmin_2m3_1', 0.9751256747221002], ['tmp_sfc5_1', 0.9744478191517459], ['tmin_2m2_1', 0.9743529570168544], ['ulwrf_s5_1', 0.9734148064192735], ['tmp_2m_5_1', 0.9731828651939985], ['tmp_2m_4_1', 0.9712226930987012], ['tmp_sfc3_1', 0.9698096931439735], ['tmp_2m_1_1', 0.968308266178652], ['tmin_2m1_1', 0.9654648681213764], ['ulwrf_s4_1', 0.9651706956885256], ['tmp_sfc1_1', 0.9621315029521358], ['tmp_sfc4_1', 0.9604279752910915], ['tmax_2m1_1', 0.9578768560091268], ['ulwrf_s1_1', 0.9574576992187499]], [['Columna: ulwrf_s4_1'], ['ulwrf_s5_1', 0.9963430558611763], ['tmp_sfc4_1', 0.9957411309514121], ['tmp_sfc3_1', 0.9947627885687921], ['tmp_2m_4_1', 0.9875947598774877], ['tmax_2m4_1', 0.987077087459209], ['tmax_2m5_1', 0.9863521279974694], ['tmin_2m4_1', 0.9837254094808376], ['tmp_2m_3_1', 0.9807390701133252], ['tmax_2m3_1', 0.9784038012441758], ['tmin_2m5_1', 0.9737488476761204], ['tmp_2m_5_1', 0.9695091232286572], ['ulwrf_s3_1', 0.9651706956885256], ['tmp_sfc5_1', 0.960457161005563], ['tmp_sfc2_1', 0.9572236269499521]], [['Columna: ulwrf_s5_1'], ['tmp_sfc4_1', 0.996612200039398], ['ulwrf_s4_1', 0.9963430558611763], ['tmp_2m_4_1', 0.9926662202081515], ['tmax_2m4_1', 0.9915793553887863], ['tmax_2m5_1', 0.99151293872598], ['tmp_sfc3_1', 0.9892581590922543], ['tmin_2m4_1', 0.9881522932211056], ['tmin_2m5_1', 0.9873222874147335], ['tmp_2m_5_1', 0.9852702329370912], ['tmp_2m_3_1', 0.9850287211532148], ['tmax_2m3_1', 0.9831073628529786], ['tmp_sfc5_1', 0.9789322162596626], ['ulwrf_s3_1', 0.9734148064192735], ['tmp_sfc2_1', 0.9670339151339817], ['tmp_2m_2_1', 0.960980140611558], ['tmax_2m2_1', 0.9582268665523904]], [['Columna: ulwrf_t1_1']], [['Columna: ulwrf_t2_1'], ['ulwrf_t3_1', 0.9744666921198298]], [['Columna: ulwrf_t3_1'], ['ulwrf_t2_1', 0.9744666921198298]], [['Columna: ulwrf_t4_1'], ['ulwrf_t5_1', 0.9755542908956468]], [['Columna: ulwrf_t5_1'], ['ulwrf_t4_1', 0.9755542908956468]], [['Columna: uswrf_s1_1']], [['Columna: uswrf_s2_1'], ['dswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9591814530708258]], [['Columna: uswrf_s3_1']], [['Columna: uswrf_s4_1'], ['uswrf_s5_1', 0.9562280634672189]], [['Columna: uswrf_s5_1'], ['uswrf_s4_1', 0.9562280634672189]], [['Columna: salida']]]
""" seaborne Correlation Heat Map """
# It needs to show all the columns
fig, ax = plt.subplots(figsize=(19, 18))
plt.title("Correlation Heat Map", y=1)
# We use blue color scale because it is easier to see the annotations and the correlation values
sns.heatmap(
correlation,
square=True,
cmap="Blues",
annot=True,
fmt=".2f",
annot_kws={"size": 4},
cbar_kws={"shrink": 0.5},
vmin=0.0,
vmax=1,
)
# We can modify vmax=0.95 in order to get same color scale for values with more than 0.95 correlation
# Note: it delays around 15 seconds as it needs to plot a 76*76 matrix with its 5766 correlation values
# Exporting image as png to ../data/img folder - easier to visualize the annotations, better resolution
plt.savefig("../data/img/correlation_heatmap.png", dpi=200)
We can observe clearly how there are a lot of correlations between the different attributes, which is expected as they are all weather related variables.
This is important to know as it will allow us to know which attributes are redundant and which ones are not, so that we can delete the redundant ones in order to improve the model.
Once obtained the most correlated columns of the dataset, we can plot them and visualize their correlation.
# 1
columns = ['apcp_sf1_1', 'apcp_sf2_1', 'apcp_sf3_1','apcp_sf4_1', 'apcp_sf5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 2
columns = [ 'dlwrf_s1_1', 'dlwrf_s2_1', 'dlwrf_s3_1', 'dlwrf_s4_1', 'dlwrf_s5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 3
columns = ['pwat_ea1_1', 'pwat_ea2_1','pwat_ea3_1','pwat_ea4_1','pwat_ea5_1', 'dlwrf_s1_1', 'dlwrf_s2_1', 'dlwrf_s3_1', 'dlwrf_s4_1', 'dlwrf_s5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 4
columns = ['dswrf_s1_1', 'dswrf_s2_1', 'dswrf_s3_1', 'dswrf_s4_1', 'dswrf_s5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 5
columns = ['dswrf_s1_1', 'dswrf_s2_1', 'dswrf_s3_1', 'dswrf_s4_1', 'dswrf_s5_1', 'uswrf_s1_1', 'uswrf_s2_1', 'uswrf_s3_1', 'uswrf_s4_1', 'uswrf_s5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 6
columns = ['pres_ms1_1', 'pres_ms2_1', 'pres_ms3_1', 'pres_ms4_1', 'pres_ms5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 7
columns = ['pwat_ea1_1', 'pwat_ea2_1','pwat_ea3_1','pwat_ea4_1','pwat_ea5_1', 'spfh_2m1_1', 'spfh_2m2_1', 'spfh_2m3_1', 'spfh_2m4_1', 'spfh_2m5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 8
columns = ['spfh_2m1_1', 'spfh_2m2_1', 'spfh_2m3_1', 'spfh_2m4_1', 'spfh_2m5_1','ulwrf_s1_1', 'ulwrf_s2_1', 'ulwrf_s3_1', 'ulwrf_s4_1', 'ulwrf_s5_1']
sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
# 9
columns = ['tmax_2m1_1', 'tmax_2m2_1', 'tmax_2m3_1', 'tmax_2m4_1', 'tmax_2m5_1', 'tmin_2m1_1', 'tmin_2m2_1', 'tmin_2m3_1', 'tmin_2m4_1', 'tmin_2m5_1','tmp_2m_1_1', 'tmp_2m_2_1', 'tmp_2m_3_1', 'tmp_2m_4_1', 'tmp_2m_5_1', 'tmp_sfc1_1', 'tmp_sfc2_1', 'tmp_sfc3_1', 'tmp_sfc4_1', 'tmp_sfc5_1','ulwrf_s1_1', 'ulwrf_s2_1', 'ulwrf_s3_1', 'ulwrf_s4_1', 'ulwrf_s5_1']
sns.pairplot(train_df[columns], height = 1 ,kind ='scatter',diag_kind='kde')
plt.show()
# 10
columns = ["ulwrf_t1_1", "ulwrf_t2_1", "ulwrf_t3_1"]
sns.pairplot(train_df[columns], height=1, kind="scatter", diag_kind="kde")
plt.show()
# 11
columns = ['ulwrf_t4_1', 'ulwrf_t5_1', ]
sns.pairplot(train_df[columns], height = 1 ,kind ='scatter',diag_kind='kde')
plt.show()
# 12
columns = ["uswrf_s2_1", "uswrf_s3_1", "uswrf_s4_1", "uswrf_s5_1"]
sns.pairplot(train_df[columns], height=1, kind="scatter", diag_kind="kde")
plt.show()
In the graphs above, we can observe that the most correlated variables exhibit a linear (and non linear) relationship between them and with the output. This is evident in the diagonal pattern in the graph, indicating that both variables increase or decrease together.
As we have previously mentioned, this is expected as the variables are all weather-related, such as radiative waves, rain, and clouds. It is normal for them to exhibit correlation at different times of the day within the same day, and it is important to consider this when creating the model and eliminating redundant variables as highly correlated variables provide redundant information and can negatively impact model performance. By identifying and removing redundant variables, the model becomes more focused, interpretable, and less prone to overfitting.
Since we are working with a time dependent data, we need to avoid mixing it. Also, we are required to add the first 10 years of data to the train set and the last 2 years to the test set. This means we are assigning a 83.333333 percent of the data to train and a 16.66666666 to test.
Note: This division was already done before the EDA. We overwrite it to start from a clean state.
Note: iloc is useful when we want to split data based on the index or other criteria, while train_test_split is useful when wanting to randomly split data into training and testing subsets.
Therefore, we will use iloc to split the data into train and test sets as we are dealing with time dependent data.
import time
import matplotlib.pyplot as plt
# Import the metrics from sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.pipeline import Pipeline
# As we have noted during the EDA, for this dataset full of outliers, its preferable to use the RobustScaler
# Although this wont make a huge difference
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV, GridSearchCV
""" Train Test Split (time series) """
np.random.seed(10)
# * Make a copy of the dataframe (as Padas dataframe is mutable, therefore uses a reference)
disp_df_copy = disp_df.copy()
# print(disp_df)
# print(disp_df_copy)
# Now we make the train_x, train_y, test_x, test_y splits taking into account the time series
# Note: the time series is ordered by date, therefore we need to split the data in a way that the train data is before the test data
# Note: the 10 first years are used for training and the last two years for testing
# Note: this is done because if not, we will be predicting the past from the future, which leads to errors and overfitting (data leakage) in the model
# * Calculate the number of rows for training and testing
num_rows = disp_df_copy.shape[0]
num_train_rows = int(
num_rows * 10 / 12
) # 10 first years for training, 2 last years for testing
# * Split the data into train and test dataframes (using iloc instead of train_test_split as it picks random rows)
train_df = disp_df_copy.iloc[
:num_train_rows, :
] # train contains the first 10 years of rows
test_df = disp_df_copy.iloc[
num_train_rows:, :
] # test contains the last 2 years of rows
# Print the number of rows for each dataframe
print(f"Number of rows for training: {train_df.shape[0]}")
print(f"Number of rows for testing: {test_df.shape[0]}")
# Print the dataframes
# print(train_df), print(test_df)
# * Separate the input features and target variable for training and testing
X_train = train_df.drop("salida", axis=1) # This is the input features for training
y_train = train_df["salida"] # This is the target variable for training
X_test = test_df.drop("salida", axis=1) # This is the input features for testing
y_test = test_df["salida"] # This is the target variable for testing
# We also make a simulation of the exact 5th fold (4 for training and 1 for testing from the training data)
num_rows_train = train_df.shape[0]
num_train_rows_train = int(num_rows_train * 4 / 5) # 4 folds for training, 1 fold for testing
train_5th_fold_train_df = train_df.iloc[
:num_train_rows_train, :
] # train_5th_fold_train contains the first 4 folds of rows
test_5th_fold_train_df = train_df.iloc[
num_train_rows_train:, :
] # test_5th_fold_train contains the last fold of rows
# * Separate the input features and target variable for training and testing
X_train_5th_fold_train = train_5th_fold_train_df.drop("salida", axis=1) # This is the input features for training
y_train_5th_fold_train = train_5th_fold_train_df["salida"] # This is the target variable for training
X_test_5th_fold_train = test_5th_fold_train_df.drop("salida", axis=1) # This is the input features for testing
y_test_5th_fold_train = test_5th_fold_train_df["salida"] # This is the target variable for testing
print(f"Number of rows for training in the 5th fold: {train_5th_fold_train_df.shape[0]}")
print(f"Number of rows for testing in the 5th fold: {test_5th_fold_train_df.shape[0]}")
# Print the shapes of the dataframes
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, X_train_5th_fold_train.shape, y_train_5th_fold_train.shape, X_test_5th_fold_train.shape, y_test_5th_fold_train.shape)
Number of rows for training: 3650 Number of rows for testing: 730 Number of rows for training in the 5th fold: 2920 Number of rows for testing in the 5th fold: 730 (3650, 75) (3650,) (730, 75) (730,) (2920, 75) (2920,) (730, 75) (730,)
This function is used to get the MAE and RMSE values of the diffent models, therefore, it will also show the level of overfitting in the models. To perform this analysis, we compare the results of the training dataset from the first fold created by the time-series split, with the validation results of the same fold. Note that by using the training and validation sets, we avoid using the test set for any analysis, which is not recommended.
As with time split folds the largest one (more training data, as test is the same) is the fifth fold in our case, we will use it to compare the results of the different models and obtain the MAE and RMSE values. This way we can compare the results of the different models and see which one is the best. We could calculate and plot the MAE and RMSE values for each fold, but this would be time consuming and would not provide us with any additional information (we haves tested the different results with the different folds prior to this assumptions).
This way, we obtain a rather similar result in train and a relative aproximation of the test results, which is what we are looking for.
Note that we also add the posibility to make the predictions on test, but they will be only used in the final model, as we will see later.
np.random.seed(10)
def train_validation_test(m, model, score, X_train, y_train, test = False, X_test = None, y_test = None):
# Train
y_train_pred = model.predict(X_train)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)
mae_train = mean_absolute_error(y_train, y_train_pred)
# Test
if test:
y_test_pred = model.predict(X_test)
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)
mae_test = mean_absolute_error(y_test, y_test_pred)
# We retrain the model with the partial training data (4 folds) and test it with the 5th fold
np.random.seed(10)
m.fit(X = X_train_5th_fold_train, y = y_train_5th_fold_train)
# Train in validation fold (5)
y_train_validation_pred = m.predict(X_train_5th_fold_train)
rmse_train_validation = mean_squared_error(y_train_5th_fold_train, y_train_validation_pred, squared=False)
mae_train_validation = mean_absolute_error(y_train_5th_fold_train, y_train_validation_pred)
# Test in validation fold (5)
y_test_validation_pred = m.predict(X_test_5th_fold_train)
rmse_test_validation = mean_squared_error(y_test_5th_fold_train, y_test_validation_pred, squared=False)
mae_test_validation = mean_absolute_error(y_test_5th_fold_train, y_test_validation_pred)
# ! Print results
print(f"Results of the best estimator of {model.__class__.__name__}")
print(f"NMAE in validation: {score:.2f}")
print(f"RMSE train: {rmse_train:.2f}", f"MAE train: {mae_train:.2f}", sep=" | ")
if test:
print(f"RMSE test: {rmse_test:.2f}", f"MAE test: {mae_test:.2f}", sep=" | ")
print(f"RMSE validation train: {rmse_train_validation:.2f}", f"MAE validation train: {mae_train_validation:.2f}", sep=" | ")
print(f"RMSE validation test: {rmse_test_validation:.2f}", f"MAE validation test: {mae_test_validation:.2f}", sep=" | ")
# ! Train
title = f'Prediction Errors (RMSE: {rmse_train:.2f}, MAE: {mae_train:.2f})'
scatterplot_histogram(X_train, y_train, y_train_pred, "Train", title)
# ! Test
if test:
title = f'Prediction Errors (RMSE: {rmse_test:.2f}, MAE: {mae_test:.2f})'
scatterplot_histogram(X_test, y_test, y_test_pred, "Test", title)
# ! Train in validation fold (5)
title = f'Prediction Errors (RMSE: {rmse_train_validation:.2f}, MAE: {mae_train_validation:.2f})'
scatterplot_histogram(X_train_5th_fold_train, y_train_5th_fold_train, y_train_validation_pred, "Train in validation", title)
# ! Test in validation fold (5)
title = f'Prediction Errors (RMSE: {rmse_test_validation:.2f}, MAE: {mae_test_validation:.2f})'
scatterplot_histogram(X_test_5th_fold_train, y_test_5th_fold_train, y_test_validation_pred, "Test in validation", title)
if test:
return [score, rmse_train, mae_train, rmse_train_validation, mae_train_validation, rmse_test_validation, mae_test_validation, rmse_test, mae_test,]
return [score, rmse_train, mae_train, rmse_train_validation, mae_train_validation, rmse_test_validation, mae_test_validation]
def scatterplot_histogram (X, y, y_pred, name, title):
# Make plots smaller to fit better on the notebook
plt.rcParams['figure.figsize'] = [5.5, 3.5]
# Train accuracy using scatter plot
plt.plot(X.iloc[:, [0]], y, ".", label=f"{name}")
plt.plot(X.iloc[:, [0]], y_pred, "r.", label=f"{name} prediction")
plt.title(f"{name} scatter plot")
plt.xlabel(f"{title}")
plt.legend()
plt.show()
# Calculate the difference between test predictions and test data
prediction_errors = y - y_pred
# Plot the distribution of prediction errors
plt.hist(prediction_errors, bins=25)
plt.xlabel(f'{title}')
plt.ylabel('Frequency')
plt.title(f"{name} prediction errors")
plt.show()
def print_results(name, model, score, time, test=False):
print("---------------------------------------------------")
print(f"{name} best model is:\n\n{model}")
print("\nParameters:", model.best_params_)
print(
f"\nPerformance: NMAE (val): {score[0]}",
f"RMSE train: {score[1]}",
f"MAE train: {score[2]}",
f"RMSE train in validation: {score[3]}",
f"MAE train in validation: {score[4]}",
f"RMSE test in validation: {score[5]}",
f"MAE test in validation: {score[6]}",
sep=" | ",
)
if test:
print(
f"RMSE test: {score[7]}",
f"MAE test: {score[8]}",
sep=" | ",
)
print(f"Execution time: {time}s")
We calculate the subsets used for training and testing in the different folds of the cross-validation.
Note: this function will not be used as we already made the fifth fold partition manually above (which is faster and does not need to be recomputed). However, it is useful to have it in case we want to use it in the future (for other folds, as it stores them all).
def validation_splits(model, X_train):
dict_folds = {}
for n_splits, (train_index, test_index) in enumerate(model.cv.split(X_train)):
index = "F" + str(n_splits + 1)
train_index_formatted = []
test_index_formatted = []
for i in range(len(train_index)):
train_index_formatted.append("V" + str(int(train_index[i] + 1)))
for i in range(len(test_index)):
test_index_formatted.append("V" + str(int(test_index[i] + 1)))
dict_folds[index] = [train_index_formatted, test_index_formatted]
return dict_folds
For each possible method we have created two different models; One with predefined parameters and the second one with selected parameters. For each model we create a pipeline which includes the escaler ( except for trees and related ) and the model. Note that we have selected RobustEscaler as our scaling method since we have found several outliers in the EDA. Secondly, we duplicate this two models per method and we add the selection of attributes. Note that the model with no selection of attributes and the one with selection of attributes have a double pipeline. Is a double pipeline since we use the output of the first pipeline ( best hiper-parameters ) directly into the second pipeline in order to avoid innecesary computing cost.
We have decided to train all models in the most similar way possible in order for the results to be comparable. This way, all models with selected parameters use RandomSearch in order to avoid unnecessary computational cost while still producing good results. Secondly, we have decided to use TimeSeriesSplit, which is a useful method when working with time-related data. We also perform a cross-validation within the parameter search in order to avoid optimistic scoring for some parameters. For all models, we are using a 5-fold cross-validation. We also decided to use NMAE as our method for testing error since it provides an easy-to-understand score and reduces the weight of outliers (as observed during the EDA process).
In addition note that in order to create the predefined models we are using gridsearch with just one option in the param-grid. This help us stay consistent in the way we create and compare models, since it provides a way of using cross-validation within the function.
During this section, we will analyze the performance of three methods: KNN, Regression Trees, and Linear Regression. For each method, we will provide a predefined model and another model with selected hyper-parameters. Our hypothesis is that the selected models will provide better performance, while the predefined ones will be better in terms of timing.
Please note that we will be using GridSearch with only one possibility (the predefined one) for the hyper-parameter to make it easier to create the predefined models. Additionally, we have decided to use RandomSearch for the selection of the parameters as it has been shown to provide good results with much less computing required.
# Three dictionaries to store the results of the models
models, results, times = {}, {}, {}
KNN (k-Nearest Neighbors) is a non-parametric algorithm used for classification and regression. It works by finding the k closest training examples in the feature space to a new input, and assigns the output value based on the majority class among the k neighbors in the case of classification or the average of the output values in the case of regression(our case). The value of k is a hyperparameter that must be chosen before training the model.
from sklearn.neighbors import KNeighborsRegressor
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("scale", RobustScaler()),
("model", KNeighborsRegressor()),
]
)
param_grid = {
"model__n_neighbors": [5],
"model__weights": ["uniform"],
"model__metric": ["minkowski"],
"model__algorithm": ["auto"],
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["KNN_pred"] = model
results["KNN_pred"] = score
times["KNN_pred"] = total_time
print_results("KNN PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -3239984.25 RMSE train: 3517654.38 | MAE train: 2493007.22 RMSE validation train: 3557480.48 | MAE validation train: 2518007.16 RMSE validation test: 4152140.06 | MAE validation test: 2892257.01
---------------------------------------------------
KNN PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scale', RobustScaler()),
('model', KNeighborsRegressor())]),
n_jobs=-1,
param_grid={'model__algorithm': ['auto'],
'model__metric': ['minkowski'],
'model__n_neighbors': [5],
'model__weights': ['uniform']},
scoring='neg_mean_absolute_error')
Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform'}
Performance: NMAE (val): -3239984.25 | RMSE train: 3517654.379918169 | MAE train: 2493007.2164383563 | RMSE train in validation: 3557480.484807456 | MAE train in validation: 2518007.157534247 | RMSE test in validation: 4152140.058048495 | MAE test in validation: 2892257.01369863
Execution time: 4.260819673538208s
np.random.seed(10)
n_splits = 5
# Using a pipeline to scale the data and then apply the model
pipeline = Pipeline(
[
("scale", RobustScaler()),
("select", SelectKBest(f_regression)),
("model", KNeighborsRegressor()),
]
)
param_grid = {
"model__n_neighbors": [5],
"model__weights": ["uniform"],
"model__metric": ["minkowski"],
"model__algorithm": ["auto"],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["KNN_pred_k"] = model
results["KNN_pred_k"] = score
times["KNN_pred_k"] = total_time
print_results("KNN PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2690780.41 RMSE train: 3108869.52 | MAE train: 2162755.27 RMSE validation train: 3116226.54 | MAE validation train: 2171515.71 RMSE validation test: 3775814.09 | MAE validation test: 2560118.55
---------------------------------------------------
KNN PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scale', RobustScaler()),
('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model', KNeighborsRegressor())]),
n_jobs=-1,
param_grid={'model__algorithm': ['auto'],
'model__metric': ['minkowski'],
'model__n_neighbors': [5],
'model__weights': ['uniform'],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform', 'select__k': 6}
Performance: NMAE (val): -2690780.4078947366 | RMSE train: 3108869.5243311627 | MAE train: 2162755.2657534247 | RMSE train in validation: 3116226.5417679944 | MAE train in validation: 2171515.705479452 | RMSE test in validation: 3775814.085873258 | MAE test in validation: 2560118.5479452056
Execution time: 4.260819673538208s
Since the NMAE is normalized by the mean absolute error of the test set, it is expected to be different from the MAE calculated directly using the mean_absolute_error function. The NMAE is a way to evaluate the performance of a model in a cross-validation setting, while the MAE is a direct measure of the model's performance on the training set.
Therefore, as we can not use the results of RMSE nor MAE in test, we will use the NMAE scoring given in validation to select the best model (as it is a fairly correct aproximation).
As seen during the EDA, we have a lot of outliers in the dataset, so we will use a Robust Scaler to scale the data, as it is more robust to outliers than the Standard Scaler or the MinMax Scaler.
In order to make the process of comparing the Selected parameters with the Predefined parameters, we will create two different models, one for each set of parameters, created one from another with the best parameters found in the previous step and a pipeline with the preprocessing steps.
Note that KNN with the parameters it selects tend to overfit to the data in this dataset, as it can be seen in the different scatterplots and results. Moreover, we can make sure that it is kind of overfitting (apart from the fact that it has an score of 0 for both MAE and RMSE in train and train validation) as the result in the validation (5th fold) test is not near that good.
For this model, as explained in the introduction of this section, the main parameter to consider is the number of neighbors. Additionally, we have identified other relevant parameters that need to be chosen:
rmse = []
mae = []
rmse2 = []
mae2 = []
a_n_neighbors = range(1, 50, 2)
a_metric = ["euclidean", "manhattan", "minkowski", "chebyshev"]
for i in a_n_neighbors:
model = KNeighborsRegressor(n_neighbors=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
for i in a_metric:
model = KNeighborsRegressor(metric=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(8, 12))
# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_n_neighbors), rmse, label="RMSE")
ax1.set_xlabel("n_neighbors")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")
# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_n_neighbors), mae, label="MAE")
ax2.set_xlabel("n_neighbors")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")
# Graficar RMSE vs. metric en el tercer subplot
ax3.plot(a_metric, rmse2, label="RMSE")
ax3.set_xlabel("metric")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")
# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(a_metric, mae2, label="MAE")
ax4.set_xlabel("metric")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")
plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()
np.random.seed(10)
budget = 75
n_splits = 5
pipeline = Pipeline(
[
("scaler", RobustScaler()),
("model", KNeighborsRegressor()),
]
)
param_grid = {
"model__n_neighbors": list(range(1, 50, 2)),
"model__weights": ["uniform", "distance"],
"model__metric": ["euclidean", "manhattan", "minkowski", "chebyshev"],
"model__algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
}
model = RandomizedSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(
n_splits
), # TimeSeriesSplit to split the data in folds without losing the temporal order
n_iter=budget,
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["KNN_select"] = model
results["KNN_select"] = score
times["KNN_select"] = total_time
print_results("KNN SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2880131.56 RMSE train: 0.00 | MAE train: 0.00 RMSE validation train: 0.00 | MAE validation train: 0.00 RMSE validation test: 3732609.98 | MAE validation test: 2587777.13
---------------------------------------------------
KNN SELECTED PARAMETERS best model is:
RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('model', KNeighborsRegressor())]),
n_iter=75, n_jobs=-1,
param_distributions={'model__algorithm': ['auto',
'ball_tree',
'kd_tree',
'brute'],
'model__metric': ['euclidean',
'manhattan',
'minkowski',
'chebyshev'],
'model__n_neighbors': [1, 3, 5, 7, 9,
11, 13, 15, 17,
19, 21, 23, 25,
27, 29, 31, 33,
35, 37, 39, 41,
43, 45, 47, 49],
'model__weights': ['uniform',
'distance']},
scoring='neg_mean_absolute_error')
Parameters: {'model__weights': 'distance', 'model__n_neighbors': 17, 'model__metric': 'manhattan', 'model__algorithm': 'kd_tree'}
Performance: NMAE (val): -2880131.5631625694 | RMSE train: 0.0 | MAE train: 0.0 | RMSE train in validation: 0.0 | MAE train in validation: 0.0 | RMSE test in validation: 3732609.9812009404 | MAE test in validation: 2587777.1287017944
Execution time: 6.565516233444214s
# Now, we will use the previously calculated best model to add the selection of attributes through the SelectKBest function in the pipeline
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("scaler", RobustScaler()),
("select", SelectKBest(f_regression)),
("model", KNeighborsRegressor()),
]
)
# Previous best model had as parameters: {'model__weights': 'distance', 'model__n_neighbors': 9, 'model__metric': 'manhattan'}
param_grid = {
"model__n_neighbors": [9],
"model__weights": ["distance"],
"model__metric": ["manhattan"],
"model__algorithm": ["kd_tree"],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(
n_splits
),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["KNN_select_k"] = model
results["KNN_select_k"] = score
times["KNN_select_k"] = total_time
print_results("KNN SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2603870.87 RMSE train: 1355.34 | MAE train: 31.73 RMSE validation train: 25827.04 | MAE validation train: 675.92 RMSE validation test: 3681057.75 | MAE validation test: 2483096.38
---------------------------------------------------
KNN SELECTED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model', KNeighborsRegressor())]),
n_jobs=-1,
param_grid={'model__algorithm': ['kd_tree'],
'model__metric': ['manhattan'],
'model__n_neighbors': [9],
'model__weights': ['distance'],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__algorithm': 'kd_tree', 'model__metric': 'manhattan', 'model__n_neighbors': 9, 'model__weights': 'distance', 'select__k': 6}
Performance: NMAE (val): -2603870.865432223 | RMSE train: 1355.336484531192 | MAE train: 31.726027397260275 | RMSE train in validation: 25827.044900407618 | MAE train in validation: 675.9246575342465 | RMSE test in validation: 3681057.75211333 | MAE test in validation: 2483096.382277287
Execution time: 4.645997762680054s
Trees work by recursively partitioning the data into subsets based on the values of their features, creating a tree-like structure that maps each set of features to a predicted target value. Each node in the tree represents a feature, and each branch represents a decision rule based on the value of that feature. The goal is to split the data in a way that creates the most homogeneous subsets with respect to the target variable. Once the tree is constructed, it can be used to make predictions on new data by following the decision rules down the tree until a leaf node is reached, which contains the predicted target value.
Note: In trees (both regression trees and random forests), it is not necessary to scale the data, as the algorithm is not sensitive to the scale of the data.
from sklearn.tree import DecisionTreeRegressor
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("model", DecisionTreeRegressor(random_state=1)),
]
)
param_grid = {
"model__criterion": ["squared_error"],
"model__max_depth": [None],
"model__min_samples_split": [2],
"model__max_features": [None],
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RegTrees_pred"] = model
results["RegTrees_pred"] = score
times["RegTrees_pred"] = total_time
print_results("REGRESSION TREES PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -3467149.44 RMSE train: 0.00 | MAE train: 0.00 RMSE validation train: 0.00 | MAE validation train: 0.00 RMSE validation test: 4961507.79 | MAE validation test: 3406755.21
---------------------------------------------------
REGRESSION TREES PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('model',
DecisionTreeRegressor(random_state=1))]),
n_jobs=-1,
param_grid={'model__criterion': ['squared_error'],
'model__max_depth': [None],
'model__max_features': [None],
'model__min_samples_split': [2]},
scoring='neg_mean_absolute_error')
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2}
Performance: NMAE (val): -3467149.4407894737 | RMSE train: 0.0 | MAE train: 0.0 | RMSE train in validation: 0.0 | MAE train in validation: 0.0 | RMSE test in validation: 4961507.791413844 | MAE test in validation: 3406755.205479452
Execution time: 0.5705435276031494s
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
('select', SelectKBest(f_regression)),
("model", DecisionTreeRegressor(random_state=1)),
]
)
param_grid = {
"model__criterion": ["squared_error"],
"model__max_depth": [None],
"model__min_samples_split": [2],
"model__max_features": [None],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RegTrees_pred_k"] = model
results["RegTrees_pred_k"] = score
times["RegTrees_pred_k"] = total_time
print_results("REGRESSION TREES PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -3328832.17 RMSE train: 0.00 | MAE train: 0.00 RMSE validation train: 0.00 | MAE validation train: 0.00 RMSE validation test: 5002502.82 | MAE validation test: 3460965.62
---------------------------------------------------
REGRESSION TREES PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
DecisionTreeRegressor(random_state=1))]),
n_jobs=-1,
param_grid={'model__criterion': ['squared_error'],
'model__max_depth': [None],
'model__max_features': [None],
'model__min_samples_split': [2],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'select__k': 9}
Performance: NMAE (val): -3328832.171052632 | RMSE train: 0.0 | MAE train: 0.0 | RMSE train in validation: 0.0 | MAE train in validation: 0.0 | RMSE test in validation: 5002502.819275869 | MAE test in validation: 3460965.616438356
Execution time: 3.333441972732544s
Note: As we can see, the default model is clearly overfitting, as indicated by the 0 error for the train section and a high error for the test section. This is likely due to the lack of control over the maximum depth of the tree, combined with a small minimum sample split that leaves only one sample in each leaf. This causes the model to memorize each data point, leading to poor generalization performance.
Building upon the previous definition, we can reduce the most important parameters to be ajusted to the following:
rmse = []
mae = []
rmse2 = []
mae2 = []
a_max_depth = range(5, 61, 5)
a_min_samples_split = range(5, 200)
for i in a_max_depth:
model = DecisionTreeRegressor(random_state=1, max_depth=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
for i in a_min_samples_split:
model = DecisionTreeRegressor(random_state=1, min_samples_split=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(8, 12))
# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_max_depth), rmse, label="RMSE")
ax1.set_xlabel("max_depth")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")
# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_max_depth), mae, label="MAE")
ax2.set_xlabel("max_depth")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")
# Graficar RMSE vs. metric en el tercer subplot
ax3.plot(list(a_min_samples_split), rmse2, label="RMSE")
ax3.set_xlabel("min_samples_split")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")
# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(list(a_min_samples_split), mae2, label="MAE")
ax4.set_xlabel("min_samples_split")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")
plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()
np.random.seed(10)
budget = 75
n_splits = 5
pipeline = Pipeline(
[
("model", DecisionTreeRegressor(random_state=1))
]
)
param_grid = {
"model__criterion": ["absolute_error", "squared_error"],
"model__max_depth": list(range(5, 61, 5)),
"model__min_samples_split": list(range(5, 200)),
"model__max_features": ["sqrt", "log2", None],
}
# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = RandomizedSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_iter=budget,
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RegTrees_select"] = model
results["RegTrees_select"] = score
times["RegTrees_select"] = total_time
print_results("REGRESSION TREES SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2743220.58 RMSE train: 3259190.45 | MAE train: 2080612.60 RMSE validation train: 3286556.31 | MAE validation train: 2092567.19 RMSE validation test: 3914582.59 | MAE validation test: 2655352.60
---------------------------------------------------
REGRESSION TREES SELECTED PARAMETERS best model is:
RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('model',
DecisionTreeRegressor(random_state=1))]),
n_iter=75, n_jobs=-1,
param_distributions={'model__criterion': ['absolute_error',
'squared_error'],
'model__max_depth': [5, 10, 15, 20, 25,
30, 35, 40, 45, 50,
55, 60],
'model__max_features': ['sqrt', 'log2',
None],
'model__min_samples_split': [5, 6, 7, 8,
9, 10, 11,
12, 13, 14,
15, 16, 17,
18, 19, 20,
21, 22, 23,
24, 25, 26,
27, 28, 29,
30, 31, 32,
33, 34, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__min_samples_split': 106, 'model__max_features': None, 'model__max_depth': 30, 'model__criterion': 'absolute_error'}
Performance: NMAE (val): -2743220.575657895 | RMSE train: 3259190.446254432 | MAE train: 2080612.602739726 | RMSE train in validation: 3286556.310045412 | MAE train in validation: 2092567.191780822 | RMSE test in validation: 3914582.5939823505 | MAE test in validation: 2655352.602739726
Execution time: 16.35970973968506s
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("select", SelectKBest(f_regression)),
("model", DecisionTreeRegressor(random_state=1))
]
)
# Previous model Parameters: {'model__min_samples_split': 106, 'model__max_features': None, 'model__max_depth': 30, 'model__criterion': 'absolute_error'}
param_grid = {
"model__criterion": ["absolute_error"],
"model__max_depth": [30],
"model__min_samples_split": [106],
"model__max_features": [None],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RegTrees_select_k"] = model
results["RegTrees_select_k"] = score
times["RegTrees_select_k"] = total_time
print_results("REGRESSION TREES SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2727416.15 RMSE train: 3452866.62 | MAE train: 2199234.33 RMSE validation train: 3561457.96 | MAE validation train: 2280089.28 RMSE validation test: 4044668.04 | MAE validation test: 2710957.60
---------------------------------------------------
REGRESSION TREES SELECTED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
DecisionTreeRegressor(random_state=1))]),
n_jobs=-1,
param_grid={'model__criterion': ['absolute_error'],
'model__max_depth': [30],
'model__max_features': [None],
'model__min_samples_split': [106],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__criterion': 'absolute_error', 'model__max_depth': 30, 'model__max_features': None, 'model__min_samples_split': 106, 'select__k': 4}
Performance: NMAE (val): -2727416.151315789 | RMSE train: 3452866.617242818 | MAE train: 2199234.328767123 | RMSE train in validation: 3561457.960699349 | MAE train in validation: 2280089.2808219176 | RMSE test in validation: 4044668.035092536 | MAE test in validation: 2710957.602739726
Execution time: 27.90829086303711s
Linear regression is a supervised learning algorithm that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The goal is to find the best fit line that can predict the dependent variable given the independent variables.
For the selected model we will be considering Lasso and Ridge. Lasso and Ridge regression are two popular regularization techniques used with linear regression. Lasso adds a penalty term to the regression equation that encourages the model to minimize the absolute value of the regression coefficients, which can lead to some coefficients being exactly zero. Ridge regression, on the other hand, adds a penalty term that encourages the model to minimize the square of the regression coefficients, which can help prevent overfitting. These techniques can improve the performance of the linear regression model by reducing the impact of irrelevant or highly correlated features.
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
np.random.seed(10)
n_splits = 5
pipeline = Pipeline([("scaler", RobustScaler()), ("model", LinearRegression())])
param_grid = {
"model__fit_intercept": [True],
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["LinearReg_pred"] = model
results["LinearReg_pred"] = score
times["LinearReg_pred"] = total_time
print_results("LINEAR REGRESSION PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2437056.06 RMSE train: 3254352.60 | MAE train: 2321647.06 RMSE validation train: 3265297.88 | MAE validation train: 2322380.61 RMSE validation test: 3268115.48 | MAE validation test: 2265683.80
---------------------------------------------------
LINEAR REGRESSION PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('model', LinearRegression())]),
n_jobs=-1, param_grid={'model__fit_intercept': [True]},
scoring='neg_mean_absolute_error')
Parameters: {'model__fit_intercept': True}
Performance: NMAE (val): -2437056.0592061607 | RMSE train: 3254352.603690468 | MAE train: 2321647.0597032406 | RMSE train in validation: 3265297.879240584 | MAE train in validation: 2322380.6106294743 | RMSE test in validation: 3268115.4760430153 | MAE test in validation: 2265683.802964292
Execution time: 0.2956578731536865s
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("scaler", RobustScaler()),
("select", SelectKBest(f_regression)),
("model", LinearRegression()),
]
)
param_grid = {
"model__fit_intercept": [True],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["LinearReg_pred_k"] = model
results["LinearReg_pred_k"] = score
times["LinearReg_pred_k"] = total_time
print_results("LINEAR REGRESSION PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2421796.65 RMSE train: 3256574.00 | MAE train: 2323171.61 RMSE validation train: 3267629.55 | MAE validation train: 2322601.75 RMSE validation test: 3267567.88 | MAE validation test: 2263068.40
---------------------------------------------------
LINEAR REGRESSION PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model', LinearRegression())]),
n_jobs=-1,
param_grid={'model__fit_intercept': [True],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__fit_intercept': True, 'select__k': 72}
Performance: NMAE (val): -2421796.652193799 | RMSE train: 3256573.9989301027 | MAE train: 2323171.6092511206 | RMSE train in validation: 3267629.5529683903 | MAE train in validation: 2322601.753096195 | RMSE test in validation: 3267567.87998712 | MAE test in validation: 2263068.4012916926
Execution time: 2.2786672115325928s
Expanding upon the previous discussion, when using Lasso regression, we can focus on adjusting the following key parameters:
It's worth noting that these are just a few of the many parameters that can be adjusted when using Lasso regression. However, by focusing on these key parameters, we can gain a better understanding of how the model works and how to optimize its performance.
We can reduce the most important parameters to be adjusted for Ridge regression to the following:
Similarly, we can reduce the most important parameters for Elastic Net regression to be adjusted to the following:
Adjusting these parameters can help prevent overfitting and improve the performance of the Elastic Net regression model.
Note: due to scikit learn internal implementation of the Elastic Net model, we can obtain Lasso model by setting l1_ratio to 1 but not Ridge model by setting l1_ratio to 0, which is strange. This results in a poor performance of Elastic Net model, as for this dataset, Ridge model is way more efficient than Lasso model.
This could be occurring due to the dataset itself (outliers, correlations...), the dataset handling, the library version, the dependencies, the python version, or the virtual environment.
rmse = []
mae = []
rmse2 = []
mae2 = []
a_alpha = np.logspace(-2, 5, 75)
for i in a_alpha:
model = Lasso(fit_intercept=True, tol=0.5, random_state=10, alpha = i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 12))
# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_alpha), rmse, label="RMSE")
ax1.set_xlabel("alpha")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")
# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_alpha), mae, label="MAE")
ax2.set_xlabel("alpha")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")
plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()
np.random.seed(10)
budget = 75
n_splits = 5
all_scores = []
# ! Pipelines
pipeline_lasso = Pipeline(
[
("scaler", RobustScaler()),
("model", Lasso(fit_intercept=True, tol=0.5, random_state=10)),
]
)
pipeline_ridge = Pipeline(
[
("scaler", RobustScaler()),
("model", Ridge(fit_intercept=True, random_state=10)),
]
)
pipeline_elastic = Pipeline(
[
("scaler", RobustScaler()),
("model", ElasticNet(fit_intercept=True, tol=0.5, random_state=10)),
]
)
# ! Parameter grids
param_grid_lasso = {
"model__alpha": np.logspace(-2, 5, 75), # Between 0.001 and 100000
}
param_grid_ridge = {
"model__alpha": np.logspace(-2, 1, 75), # Between 0.001 and 10
}
param_grid_elastic = {
"model__alpha": np.logspace(-2, 5, 75), # Between 0.001 and 10
"model__l1_ratio": np.linspace(0, 1, 75), # Between 0 and 1
}
# ! If we want to use random values for the parameters -> unconsistency in the results
regr_lasso = RandomizedSearchCV(
pipeline_lasso,
param_grid_lasso,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(),
n_iter=budget,
n_jobs=-1,
)
regr_ridge = RandomizedSearchCV(
pipeline_ridge,
param_grid_ridge,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(),
n_iter=budget,
n_jobs=-1,
)
regr_elastic = RandomizedSearchCV(
pipeline_elastic,
param_grid_elastic,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(),
n_iter=budget,
n_jobs=-1,
)
model = [regr_lasso, regr_ridge, regr_elastic]
ln_reg_time, scoring = [], []
for i in model:
start_time = time.time()
i.fit(X=X_train, y=y_train)
print(f"Model: {i.best_score_}")
print(i.best_params_)
# Now we reevaluate the model on the test set to obtain more accurate results
# Calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_folds = validation_splits(i, X_train)
scoring.append(i.best_score_)
all_scores.append(
train_validation_test(
i,
i.best_estimator_,
i.best_score_,
X_train,
y_train,
)
)
ln_reg_time.append(time.time() - start_time)
print(ln_reg_time)
# Select the best model (based on the MAE)
max_score = min(
all_scores, key=lambda x: abs(x[0])
) # Best model is the one that minimizes the validation NMAE
best_model = model[all_scores.index(max_score)]
total_time = ln_reg_time[all_scores.index(max_score)]
models["LinearReg_select"] = best_model
results["LinearReg_select"] = max_score
times["LinearReg_select"] = total_time
# Print results
print_results("LINEAR REGRESSION SELECTED PARAMETERS", best_model, score, total_time)
Model: -2916665.224197623
{'model__alpha': 52025.49442372698}
Results of the best estimator of Pipeline
NMAE in validation: -2916665.22
RMSE train: 3876831.69 | MAE train: 2906173.48
RMSE validation train: 3922101.26 | MAE validation train: 2937813.23
RMSE validation test: 3923699.07 | MAE validation test: 2858386.71
Model: -2396352.0117066414
{'model__alpha': 0.9693631061142517}
Results of the best estimator of Pipeline
NMAE in validation: -2396352.01
RMSE train: 3276534.92 | MAE train: 2333075.68
RMSE validation train: 3292824.95 | MAE validation train: 2337932.29
RMSE validation test: 3280253.30 | MAE validation test: 2260087.83
Model: -2658838.747674453
{'model__l1_ratio': 0.43243243243243246, 'model__alpha': 0.26237286577779917}
Results of the best estimator of Pipeline
NMAE in validation: -2658838.75
RMSE train: 3592222.21 | MAE train: 2660152.84
RMSE validation train: 3604425.08 | MAE validation train: 2661450.93
RMSE validation test: 3576311.08 | MAE validation test: 2558163.00
[4.613823652267456, 4.527256488800049, 4.600627183914185]
---------------------------------------------------
LINEAR REGRESSION SELECTED PARAMETERS best model is:
RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('model',
Ridge(random_state=10))]),
n_iter=75, n_jobs=-1,
param_distributions={'model__alpha': array([ 0.01 , 0.01097844, 0.01205261, 0.01323188, 0.01452654,
0.01594787, 0.01750827, 0.01922135, 0.02110203, 0.02316674,
0.025...
0.66730492, 0.73259654, 0.80427655, 0.88297 , 0.96936311,
1.06420924, 1.16833549, 1.28264983, 1.40814912, 1.54592774,
1.69718713, 1.86324631, 2.04555335, 2.245698 , 2.46542555,
2.70665207, 2.9714811 , 3.26222201, 3.5814101 , 3.93182876,
4.31653369, 4.73887961, 5.20254944, 5.71158648, 6.27042962,
6.88395207, 7.55750387, 8.29695852, 9.1087642 , 10. ])},
scoring='neg_mean_absolute_error')
Parameters: {'model__alpha': 1.0642092440647246}
Performance: NMAE (val): -2421796.652193799 | RMSE train: 3256573.9989301027 | MAE train: 2323171.6092511206 | RMSE train in validation: 3267629.5529683903 | MAE train in validation: 2322601.753096195 | RMSE test in validation: 3267567.87998712 | MAE test in validation: 2263068.4012916926
Execution time: 4.527256488800049s
np.random.seed(10)
n_splits = 5
# We use Ridge as model as it is the best performing one
pipeline = Pipeline(
[
("scaler", RobustScaler()),
("select", SelectKBest(f_regression)),
("model", Ridge(fit_intercept=True, random_state=10)),
]
)
# Previous model Parameters: {'model__alpha': 0.9693631061142517}
param_grid = {
"model__alpha": [0.9693631061142517],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["LinearReg_select_k"] = model
results["LinearReg_select_k"] = score
times["LinearReg_select_k"] = total_time
# Print results
print_results("LINEAR REGRESSION SELECTED PARAMETERS", best_model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2389586.49 RMSE train: 3278341.47 | MAE train: 2333541.31 RMSE validation train: 3293274.52 | MAE validation train: 2336194.58 RMSE validation test: 3278610.61 | MAE validation test: 2258218.41
---------------------------------------------------
LINEAR REGRESSION SELECTED PARAMETERS best model is:
RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('model',
Ridge(random_state=10))]),
n_iter=75, n_jobs=-1,
param_distributions={'model__alpha': array([ 0.01 , 0.01097844, 0.01205261, 0.01323188, 0.01452654,
0.01594787, 0.01750827, 0.01922135, 0.02110203, 0.02316674,
0.025...
0.66730492, 0.73259654, 0.80427655, 0.88297 , 0.96936311,
1.06420924, 1.16833549, 1.28264983, 1.40814912, 1.54592774,
1.69718713, 1.86324631, 2.04555335, 2.245698 , 2.46542555,
2.70665207, 2.9714811 , 3.26222201, 3.5814101 , 3.93182876,
4.31653369, 4.73887961, 5.20254944, 5.71158648, 6.27042962,
6.88395207, 7.55750387, 8.29695852, 9.1087642 , 10. ])},
scoring='neg_mean_absolute_error')
Parameters: {'model__alpha': 1.0642092440647246}
Performance: NMAE (val): -2389586.491181177 | RMSE train: 3278341.466529396 | MAE train: 2333541.305110323 | RMSE train in validation: 3293274.5203141714 | MAE train in validation: 2336194.5845998474 | RMSE test in validation: 3278610.608896576 | MAE test in validation: 2258218.4050652594
Execution time: 2.1258764266967773s
To be observed, the selected model, Ridge, does not delete any of the attributes (as expected, as it is one of its flaws), but some of their weights are close to zero, so we can consider that they are not relevant for the model.
On the other hand, the Lasso model and the ElasticNet model, do delete some of the attributes, but the results are worse than the Ridge model, so we will not consider them.
As in the other models, in order to be able to compare the different times and scores, we will divide the dummy regressor into two different models. The first model creates the model without selecting the attributes and the second uses the best parameters of the previous one and selects the attributes through another pipeline.
As strategy we selected "median" as we are dealing with (N)MAE as scoring in the other methods.
from sklearn.dummy import DummyRegressor
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("scaler", RobustScaler()),
("model", DummyRegressor()),
]
)
param_grid = {
'model__strategy': ['median'],
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["DummyReg"] = model
results["DummyReg"] = score
times["DummyReg"] = total_time
print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -6953359.14 RMSE train: 8058570.05 | MAE train: 6899205.37 RMSE validation train: 8120616.17 | MAE validation train: 6944040.21 RMSE validation test: 7809144.90 | MAE validation test: 6720947.26
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('model', DummyRegressor())]),
n_jobs=-1, param_grid={'model__strategy': ['median']},
scoring='neg_mean_absolute_error')
Parameters: {'model__strategy': 'median'}
Performance: NMAE (val): -6953359.144736841 | RMSE train: 8058570.051086258 | MAE train: 6899205.369863014 | RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452 | RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Execution time: 0.23421549797058105s
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("scaler", RobustScaler()),
("select", SelectKBest(f_regression)),
("model", DummyRegressor(strategy="median")),
]
)
# Previous model parameters: {'model__strategy': 'median'}
param_grid = {
'model__strategy': ['median'],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["DummyReg_k"] = model
results["DummyReg_k"] = score
times["DummyReg_k"] = total_time
print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -6953359.14 RMSE train: 8058570.05 | MAE train: 6899205.37 RMSE validation train: 8120616.17 | MAE validation train: 6944040.21 RMSE validation test: 7809144.90 | MAE validation test: 6720947.26
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
DummyRegressor(strategy='median'))]),
n_jobs=-1,
param_grid={'model__strategy': ['median'],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__strategy': 'median', 'select__k': 1}
Performance: NMAE (val): -6953359.144736841 | RMSE train: 8058570.051086258 | MAE train: 6899205.369863014 | RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452 | RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Execution time: 1.6907029151916504s
As expected, the selection of attributes does not improve the results of the dummy regressor model and it is useless in terms of performance. This is due to the fact that the dummy regressor is a model that does not take into account the attributes, so it does not matter if we select them or not, so it selects just one (k=1).
First, we adjust the times by adding to the time of the selection of the attributes the time of training the real model. This is because the selection of the attributes is done after the training of the model, so it does not contain the time of the training of the model.
This is just an approximation to how much the selection of the attributes will last, as we are using two different models and pipelines, we are not able to calculate it directly.
# We store the partial times for future use
partial_times = times.copy()
# Real time adjustment (we add the time of the attribute selection to the time of the model real training)
print(times)
for key in times.keys():
# If even, we dont add the time of the attribute selection
# If odd, we add the time of the attribute selection
if list(times.keys()).index(key) % 2 != 0:
times[key] += times[key.replace("_k", "")]
print(times)
{'KNN_pred': 4.260819673538208, 'KNN_pred_k': 4.260819673538208, 'KNN_select': 6.565516233444214, 'KNN_select_k': 4.645997762680054, 'RegTrees_pred': 0.5705435276031494, 'RegTrees_pred_k': 3.333441972732544, 'RegTrees_select': 16.35970973968506, 'RegTrees_select_k': 27.90829086303711, 'LinearReg_pred': 0.2956578731536865, 'LinearReg_pred_k': 2.2786672115325928, 'LinearReg_select': 4.527256488800049, 'LinearReg_select_k': 2.1258764266967773, 'DummyReg': 0.23421549797058105, 'DummyReg_k': 1.6907029151916504}
{'KNN_pred': 4.260819673538208, 'KNN_pred_k': 8.521639347076416, 'KNN_select': 6.565516233444214, 'KNN_select_k': 11.211513996124268, 'RegTrees_pred': 0.5705435276031494, 'RegTrees_pred_k': 3.9039855003356934, 'RegTrees_select': 16.35970973968506, 'RegTrees_select_k': 44.26800060272217, 'LinearReg_pred': 0.2956578731536865, 'LinearReg_pred_k': 2.5743250846862793, 'LinearReg_select': 4.527256488800049, 'LinearReg_select_k': 6.653132915496826, 'DummyReg': 0.23421549797058105, 'DummyReg_k': 1.9249184131622314}
np.random.seed(10)
# ! Obtain best, worst, fastest and slowest model
max_score = max(results.values(), key=lambda x: abs(x[0])) # We use the scoring (NMAE) as explained above to select the best model
min_score = min(results.values(), key=lambda x: abs(x[0]))
# Obtain the key name of the best and worst model
max_time = max(times.values(), key=lambda x: x)
min_time = min(times.values(), key=lambda x: x)
best_model = list(results.keys())[list(results.values()).index(min_score)]
worst_model = list(results.keys())[list(results.values()).index(max_score)]
fastest_model = list(times.keys())[list(times.values()).index(min_time)]
slowest_model = list(times.keys())[list(times.values()).index(max_time)]
print(f"Best model: {best_model} with score (-NMAE) {abs(min_score[0])} and time {list(times.values())[list(results.values()).index(min_score)]}s")
print(f"Worst model: {worst_model} with score (-NMAE) {abs(max_score[0])} and time {list(times.values())[list(results.values()).index(max_score)]}s")
print(f"Fastest model: {fastest_model} with score (-NMAE) {abs(results[fastest_model][0])} and time {min_time}s")
print(f"Slowest model: {slowest_model} with score(-NMAE) {abs(results[slowest_model][0])} and time {max_time}s")
# ! Average (test MAE) score of the models
avg_score = 0
avg_time = 0
for key, value in results.items():
avg_score += results[key][0]
avg_time += times[key]
print(f"\nAverage models score: {abs(avg_score/len(results))}")
print(f"Average models time: {avg_time/len(times)}\n")
# ! Differences
print("The score difference between the best and worst model is: ", abs(max_score[0] - min_score)[0]) # Scoring evaluation -NMAE
print("The score difference between the best and fastest model is: ", abs(min_score[0] - abs(results[fastest_model][0]))) # Scoring evaluation -NMAE
print("The time difference between the best and fastest model model is: ", abs(list(times.values())[list(results.values()).index(min_score)] - min_time))
print("The time difference between the fastest and slowest model is: ", abs(max_time - min_time))
Best model: LinearReg_select_k with score (-NMAE) 2389586.491181177 and time 6.653132915496826s Worst model: DummyReg with score (-NMAE) 6953359.144736841 and time 0.23421549797058105s Fastest model: DummyReg with score (-NMAE) 6953359.144736841 and time 0.23421549797058105s Slowest model: RegTrees_select_k with score(-NMAE) 2727416.151315789 and time 44.26800060272217s Average models score: 3373778.2092190557 Average models time: 7.990802492414202 The score difference between the best and worst model is: 4563772.653555664 The score difference between the best and fastest model is: 9342945.635918017 The time difference between the best and fastest model model is: 6.418917417526245 The time difference between the fastest and slowest model is: 44.03378510475159
# Print the results up to now
plt.rcParams['figure.figsize'] = [10, 3.5]
# ! Plot the scores (NMAE in evaluation)
print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[0]))
print(f"{iter}. {key}: {abs(value[0])}")
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_score.png")
plt.show()
# ! Plot the time (just the even ones == the ones that are not selectors of attributes)
print("MODEL TIMES (s)")
iter = 0
for key, value in times.items():
if iter % 2 == 0:
plt.bar(key, value)
print(f"{iter}. {key}: {value}")
iter += 1
plt.title("Time")
plt.xlabel("Model")
plt.ylabel("Time (s)")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_time.png")
plt.show()
# ! Plot the time (just the odd ones == the selectors of attributes)
print("MODEL ATTRIBUTE SELECTION TIMES (s)")
iter = 0
for key, value in times.items():
if iter % 2 != 0:
plt.bar(key, value)
print(f"{iter}. {key}: {value}")
iter += 1
plt.title("Time to select attributes")
plt.xlabel("Model")
plt.ylabel("Time (s)")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder - easier to visualize the annotations, better resolution
plt.savefig("../data/img/basic_methods_time_atb.png")
plt.show()
MODEL SCORES (NMAE in evaluation) 0. KNN_pred: 3239984.25 1. KNN_pred_k: 2690780.4078947366 2. KNN_select: 2880131.5631625694 3. KNN_select_k: 2603870.865432223 4. RegTrees_pred: 3467149.4407894737 5. RegTrees_pred_k: 3328832.171052632 6. RegTrees_select: 2743220.575657895 7. RegTrees_select_k: 2727416.151315789 8. LinearReg_pred: 2437056.0592061607 9. LinearReg_pred_k: 2421796.652193799 10. LinearReg_select: 2396352.0117066414 11. LinearReg_select_k: 2389586.491181177 12. DummyReg: 6953359.144736841 13. DummyReg_k: 6953359.144736841
MODEL TIMES (s) 0. KNN_pred: 4.260819673538208 2. KNN_select: 6.565516233444214 4. RegTrees_pred: 0.5705435276031494 6. RegTrees_select: 16.35970973968506 8. LinearReg_pred: 0.2956578731536865 10. LinearReg_select: 4.527256488800049 12. DummyReg: 0.23421549797058105
MODEL ATTRIBUTE SELECTION TIMES (s) 1. KNN_pred_k: 8.521639347076416 3. KNN_select_k: 11.211513996124268 5. RegTrees_pred_k: 3.9039855003356934 7. RegTrees_select_k: 44.26800060272217 9. LinearReg_pred_k: 2.5743250846862793 11. LinearReg_select_k: 6.653132915496826 13. DummyReg_k: 1.9249184131622314
After computing all of the models, we can draw some conclusions:
¿Are the obtained results better than the ones from the naive approach?
Yes, they are, all of the models had a way better score than the one obtained by the naive approach, which had an -NMAE of 6953359.144736841 (with and without attribute selection). This can be clearly seen in the first graph showing the different models and their scores.
There, it can be observed that all of the models we tested outperformed the dummy model by a significant margin. The dummy model produced an NMAE error of -6953359.144736841, while our worst-performing model (RegTrees_pred) produced an error of -3467149.4407894737.
¿Are models with selection of parameters better than the ones that use the predefined ones?
We found that in general, selecting hyperparameters led to better results across all of the models. However, this improvement came at the cost of increased time and computing resources. Therefore, when deciding which model to use, it is important to consider the balance between improved performance and increased training time.
¿Are models with selection of attributes better than the ones that use all of them?
By observing the performace (both time and score), and the graphs, we can observe that the models with attribute selection generally perform better than those that use all attributes as they help reduce noise and irrelevant features, resulting in a more focused set of attributes for modeling, improving accuracy and generalization. However, as it can be observed in the third graph, attribute selection increases computation time as it adds a stage of preprocessing to the model. Despite the longer training times, the benefits of improved performance may outweigh the costs. As stated with selection of parameters question, it is needed to consider specific project requirements and resources when deciding on the approach to use.
Model selection
Regarding the individual models, we observed that the LinearReg_select_k model performed the best in terms of NMAE, while the RegTrees_pred model performed the worst, as it overfits perfectly the training data.
Based on these findings, we would recommend using the LinearReg_select_k model (-NMAE: 2389586.491181177; time: 4.264322519302368) if the client prioritizes accuracy over computing time. However, if computing time is a priority, we would recommend using the LinearReg_pred model (-NMAE: 2437056.0592061607; time: 0.23032045364379883), which only sacrifices about 47470 points of accuracy points (1.95%) while reducing computing time by more than 94.2%.
On the other hand, if we wanted a balance in between those two models, there are LinearReg_pred_k (-NMAE: 2421796.652193799; time: 2.2443368434906006) and LinearReg_select (-NMAE: 2396352.0117066414; time: 2.5291051864624023), which are a good balance between accuracy and computing time.
Ultimately, the decision of which model to use depends on the client's budget and objectives. For the final prediction, we recommend using the LinearReg_select_k (or LinearReg_select) model as it still provides a good balance between accuracy and computing time. Moreover, taking into account the dataset and the problem nature, it is plausible that the model training will be done at much yearly, so the time saved by using the LinearReg_pred instead of the LinearReg_pred_k model is not significant compared to the score gained.
It is possible to reduce the problem's dimensionality, as evidenced by the findings in the EDA section, where numerous attributes were identified to be highly correlated. By removing some of these attributes, we can effectively reduce the dimensionality of the problem. Therefore, it is recommended to utilize Principal Component Analysis (PCA) as a technique to reduce the dimensionality of the problem.
As highlighted in the EDA section, there are several attributes that exhibit strong interrelationships to the point of being redundant (with a correlation higher than 98%).
There are two different approaches to reduce the dimensionality of the problem:
The first one was removing by hand the attributes seen as redundant in the EDA section. This approach was not used as it would be a tedious and error-prone process, and it would not be scalable to other datasets.
The second approach, which is the one used in the industry, is to use a feature selection algorithm preprocessing in the pipeline of the models to automatically identify and remove redundant attributes.
This second approach of using pipelines with attribute selection was the one employed in our project. It was implemented using the "SelectKBest(f_regression)" feature selector, which considers only the linear relationship between the attributes and the output variable. As consecuence, using this feature selector leaves room to more possible optimisations and selections of correlated relationships of non-linear nature or interrelationships between the attributes, as seen in the EDA section. Therefore, there is still a space to improve the results by using a more advanced feature selection algorithm, such as the "Recursive Feature Elimination" algorithm (RFE).
# Create a dictionary with all the dataset variables
def get_variable_freq():
columns = disp_df.columns.tolist()
variables = {col: 0 for col in columns}
# Getting the selected attributes for each model
for model in models.keys():
# We only want to check the models that select attributes (take into account that dummy regressor selection is included(dswrf_s3_1))
if list(models.keys()).index(model) % 2 != 0:
# We get the selected attributes
selected_atb = models[model].best_estimator_.named_steps["select"].get_support()
# We get the names of the selected attributes
selected_atb_names = X_train.columns[selected_atb]
print(f"{model} selected {len(selected_atb_names)}")
# Make a frequency table of the selected attributes
selected_atb_names = pd.DataFrame(selected_atb_names)
selected_atb_names.columns = ["Attribute"]
selected_atb_names = selected_atb_names.groupby("Attribute").size().reset_index(name="Frequency")
selected_atb_names = selected_atb_names.sort_values(by="Frequency", ascending=False)
selected_atb_names = selected_atb_names.reset_index(drop=True)
# Append the results the dictionary
for atb in selected_atb_names["Attribute"]:
variables[atb] += 1
print(f"Attributes frequency: {variables}")
# plot all the attributes and their frequency
plt.rcParams['figure.figsize'] = [10, 3.5]
for key, value in variables.items():
plt.bar(key, value)
plt.title("Frequency of the selected attributes")
plt.xlabel("Attribute")
plt.ylabel("Frequency")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=6.5)
plt.show()
get_variable_freq()
KNN_pred_k selected 6
KNN_select_k selected 6
RegTrees_pred_k selected 9
RegTrees_select_k selected 4
LinearReg_pred_k selected 72
LinearReg_select_k selected 72
DummyReg_k selected 1
Attributes frequency: {'apcp_sf1_1': 2, 'apcp_sf2_1': 2, 'apcp_sf3_1': 2, 'apcp_sf4_1': 2, 'apcp_sf5_1': 2, 'dlwrf_s1_1': 2, 'dlwrf_s2_1': 2, 'dlwrf_s3_1': 2, 'dlwrf_s4_1': 2, 'dlwrf_s5_1': 2, 'dswrf_s1_1': 2, 'dswrf_s2_1': 5, 'dswrf_s3_1': 7, 'dswrf_s4_1': 6, 'dswrf_s5_1': 6, 'pres_ms1_1': 0, 'pres_ms2_1': 0, 'pres_ms3_1': 0, 'pres_ms4_1': 2, 'pres_ms5_1': 2, 'pwat_ea1_1': 2, 'pwat_ea2_1': 2, 'pwat_ea3_1': 2, 'pwat_ea4_1': 2, 'pwat_ea5_1': 2, 'spfh_2m1_1': 2, 'spfh_2m2_1': 2, 'spfh_2m3_1': 2, 'spfh_2m4_1': 2, 'spfh_2m5_1': 2, 'tcdc_ea1_1': 2, 'tcdc_ea2_1': 2, 'tcdc_ea3_1': 2, 'tcdc_ea4_1': 2, 'tcdc_ea5_1': 2, 'tcolc_e1_1': 2, 'tcolc_e2_1': 2, 'tcolc_e3_1': 2, 'tcolc_e4_1': 2, 'tcolc_e5_1': 2, 'tmax_2m1_1': 2, 'tmax_2m2_1': 2, 'tmax_2m3_1': 2, 'tmax_2m4_1': 2, 'tmax_2m5_1': 2, 'tmin_2m1_1': 2, 'tmin_2m2_1': 2, 'tmin_2m3_1': 2, 'tmin_2m4_1': 2, 'tmin_2m5_1': 2, 'tmp_2m_1_1': 2, 'tmp_2m_2_1': 2, 'tmp_2m_3_1': 2, 'tmp_2m_4_1': 2, 'tmp_2m_5_1': 2, 'tmp_sfc1_1': 2, 'tmp_sfc2_1': 2, 'tmp_sfc3_1': 2, 'tmp_sfc4_1': 2, 'tmp_sfc5_1': 2, 'ulwrf_s1_1': 2, 'ulwrf_s2_1': 2, 'ulwrf_s3_1': 2, 'ulwrf_s4_1': 3, 'ulwrf_s5_1': 3, 'ulwrf_t1_1': 2, 'ulwrf_t2_1': 2, 'ulwrf_t3_1': 2, 'ulwrf_t4_1': 2, 'ulwrf_t5_1': 2, 'uswrf_s1_1': 2, 'uswrf_s2_1': 6, 'uswrf_s3_1': 5, 'uswrf_s4_1': 2, 'uswrf_s5_1': 3, 'salida': 0}
Upon analyzing the graph, we gain valuable insights into the attributes that are frequently selected by the feature selector like dswrf_s2_1, dswrf_s3_1, dswrf_s4_1, dswrf_s5_1, uswrf_s2_1, and uswrf_s3_1. It is evident that the attributes chosen are highly correlated with the target variable, aligning with our expectations. This reaffirms the efficacy of the feature selector in identifying relevant attributes correlated to the target variable.
Take into account that some of the mentioned attributes are also highly correlated with each other (as seen during EDA), but this is not a problem that our feature selector is able to identify.
Conversely, we can also infer that attributes that are scarcely selected by the feature selector, such as pres_ms1_1, pres_ms2_1, and pres_ms3_1, are not significant in the context of the problem. This is indicative that these attributes lack correlation with the target variable, and their inclusion in the model may introduce noise or irrelevant information. Hence, the feature selector's ability to filter out such attributes further strengthens its effectiveness in feature selection and highlights the importance of using it for improved model performance.
plt.rcParams['figure.figsize'] = [10, 3.5]
# Select the even times (the ones that are not selectors of attributes)
times_no_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 == 0}
# Select the odd times (the ones that are selectors of attributes)
times_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 != 0}
# Sum both dictionaries to get the total time of each model
for key in times_atb.keys():
times_atb[key] += times_no_atb[key.replace("_k", "")]
times_no_atb_arr = list(times_no_atb.values())
times_atb_arr = list(times_atb.values())
model_indices = np.arange(len(list(times_no_atb.keys())))
width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(model_indices - width/2, times_no_atb_arr, width, label='No attribute selection')
rects2 = ax.bar(model_indices + width/2, times_atb_arr, width, label='Attribute selection')
ax.set_xlabel('Model')
ax.set_ylabel('Times')
ax.set_title('')
ax.set_xticks(model_indices)
ax.set_xticklabels(list(times_no_atb.keys()))
ax.legend()
plt.xticks(size=5.9)
plt.show()
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[0]))
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()
plt.xticks(rotation=30, ha='right', size=7)
plt.show()
As mentioned in the "Conclusions" section (Section 5.6), the performance of the models with attribute selection is generally better compared to using all attributes. This is because attribute selection helps reduce noise and irrelevant features, resulting in a more focused set of attributes for modeling, which can lead to improved accuracy and generalization, as evident from the performance metrics and graphs. However, it should be noted that attribute selection does add an additional preprocessing stage to the model, which can increase computation time. Despite the longer training times, the potential benefits of improved performance may outweigh the costs.
The performance of the models with and without attribute selection is clearly depicted in the two above graphs:
The first graph illustrates that the models with attribute selection require more computing time compared to those using all attributes. This is expected due to the additional preprocessing stage.
The second graph demonstrates how the models with attribute selection outperform those using all attributes, as they effectively reduce noise and irrelevant features, resulting in improved performance.
In order to be consistent, although we have already seen that usign the selection of attributes makes the model better, we will continue to use the two-step pipeline method we have been using in the basic models. This way, we can also verify that the results are better than the ones obtained with the basic methods (for both with and without attribute selection).
Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression analysis. SVM works by finding the hyperplane that best separates the data into different classes. The hyperplane is chosen such that it maximizes the margin between the closest data points from each class, known as support vectors. SVM can also use kernel functions to transform the input data into a higher dimensional space, allowing the separation of non-linearly separable data.
Note: in this dataset, the target variable conatains very large values, and the default value of C is 1.0 by default, which is too small for this dataset. This will make the SVM act as if it was a dummy regressor (as seen before), where it simply predicts the mean of the target variable for all data points, leading to poor model performance. This can be readily observed in section 8 of the notebook, where we compare the values and results of all the models, including the computation time and score. Notably, we find that the results of the Support Vector Machine (SVM) model with the default value of C are identical to those of the dummy regressor.
To overcome this issue, it is important to select an appropriate value for the C parameter that matches the characteristics of the dataset. By increasing the value of C to a more suitable value, the SVM becomes more flexible and capable of fitting the data better. This allows the SVM to capture the underlying patterns and relationships in the dataset more accurately, resulting in improved prediction performance.
from sklearn.svm import SVR
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
('scaler', RobustScaler()),
("model", SVR())
]
)
param_grid = {
"model__kernel": ["rbf"],
"model__C": [1.0],
"model__gamma": ["scale"],
"model__epsilon": [0.1],
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["SVM_pred"] = model
results["SVM_pred"] = score
times["SVM_pred"] = total_time
print_results("SVM PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -6953343.12 RMSE train: 8058525.28 | MAE train: 6899170.65 RMSE validation train: 8120576.93 | MAE validation train: 6944009.69 RMSE validation test: 7809107.04 | MAE validation test: 6720917.54
---------------------------------------------------
SVM PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('model', SVR())]),
param_grid={'model__C': [1.0], 'model__epsilon': [0.1],
'model__gamma': ['scale'], 'model__kernel': ['rbf']},
scoring='neg_mean_absolute_error')
Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf'}
Performance: NMAE (val): -6953343.117286754 | RMSE train: 8058525.276321875 | MAE train: 6899170.647284467 | RMSE train in validation: 8120576.9337218935 | MAE train in validation: 6944009.686073507 | RMSE test in validation: 7809107.037449846 | MAE test in validation: 6720917.536373118
Execution time: 2.1233932971954346s
As it was stated before, it can be clearly seen that with the default 1.0 value of C, the SVM acts as a dummy regressor.
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[
("scaler", RobustScaler()),
("select", SelectKBest(f_regression)),
("model", SVR()),
]
)
param_grid = {
"model__kernel": ["rbf"],
"model__C": [1.0],
"model__gamma": ["scale"],
"model__epsilon": [0.1],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["SVM_pred_k"] = model
results["SVM_pred_k"] = score
times["SVM_pred_k"] = total_time
print_results("SVM PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -6952999.55 RMSE train: 8057851.47 | MAE train: 6898491.59 RMSE validation train: 8120039.58 | MAE validation train: 6943474.80 RMSE validation test: 7808606.98 | MAE validation test: 6720375.02
---------------------------------------------------
SVM PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', RobustScaler()),
('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model', SVR())]),
n_jobs=-1,
param_grid={'model__C': [1.0], 'model__epsilon': [0.1],
'model__gamma': ['scale'], 'model__kernel': ['rbf'],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf', 'select__k': 1}
Performance: NMAE (val): -6952999.553407727 | RMSE train: 8057851.465218631 | MAE train: 6898491.591050504 | RMSE train in validation: 8120039.578355538 | MAE train in validation: 6943474.801851249 | RMSE test in validation: 7808606.977541712 | MAE test in validation: 6720375.017111049
Execution time: 29.39649200439453s
Building upon the previous definition, we can reduce the most important parameters to be adjusted to the following for SVM:
rmse = []
mae = []
rmse2 = []
mae2 = []
a_c = [1.0, 100, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000, 100000000000, 1000000000000]
a_kernel = ["linear", "rbf", "sigmoid", "poly"]
for i in a_c:
model = SVR(C=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
for i in a_kernel:
model = SVR(kernel=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(8, 12))
# Graficar RMSE vs. C en el primer subplot
ax1.plot(a_c, rmse, label="RMSE")
ax1.set_xlabel("C")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")
# Graficar MAE vs. C en el segundo subplot
ax2.plot(a_c, mae, label="MAE")
ax2.set_xlabel("C")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")
# Graficar RMSE vs. kernel en el tercer subplot
ax3.plot(a_kernel, rmse2, label="RMSE")
ax3.set_xlabel("kernel")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")
# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(a_kernel, mae2, label="MAE")
ax4.set_xlabel("kernel")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")
plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()
Note: we needed to add more parameters to C in order to have a budget of 75 so its computing time is comparable to the other models.
np.random.seed(10)
budget = 75
n_splits = 5
pipeline = Pipeline(
[
("scaler", StandardScaler()),
# We scale the data to avoid overfitting - Recommended for SVMs
("model", SVR())
# Support Vector Regression (SVR for regression, SVC for classification)
]
)
# We need to reduce the C parameter number to reduce the computational time -> tends to infinity
param_grid = {
"model__kernel": ["linear", "rbf", "sigmoid", "poly"], # poly is too slow and not near good as linear
"model__C": [500000, 5000000, 7000000, 750000, 7750000, 800000, 8500000, 1000000, 5000000, 10000000],
"model__gamma": ["scale", "auto"],
}
# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = RandomizedSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_iter=budget,
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["SVM_select"] = model
results["SVM_select"] = score
times["SVM_select"] = total_time
print_results("SVM SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2331297.20 RMSE train: 3390722.81 | MAE train: 2254336.08 RMSE validation train: 3402804.08 | MAE validation train: 2244918.83 RMSE validation test: 3486393.92 | MAE validation test: 2328968.77
---------------------------------------------------
SVM SELECTED PARAMETERS best model is:
RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('model', SVR())]),
n_iter=75, n_jobs=-1,
param_distributions={'model__C': [500000, 5000000, 7000000,
750000, 7750000, 800000,
8500000, 1000000, 5000000,
10000000],
'model__gamma': ['scale', 'auto'],
'model__kernel': ['linear', 'rbf',
'sigmoid', 'poly']},
scoring='neg_mean_absolute_error')
Parameters: {'model__kernel': 'linear', 'model__gamma': 'auto', 'model__C': 1000000}
Performance: NMAE (val): -2331297.199428374 | RMSE train: 3390722.8061495544 | MAE train: 2254336.0822791597 | RMSE train in validation: 3402804.0781806447 | MAE train in validation: 2244918.82901025 | RMSE test in validation: 3486393.9201029483 | MAE test in validation: 2328968.7736492744
Execution time: 165.52876663208008s
np.random.seed(10)
n_splits = 5
# We use Ridge as model as it is the best performing one
pipeline = Pipeline(
[
("scaler", StandardScaler()),
("select", SelectKBest(f_regression)),
("model", SVR())
]
)
# Previous model Parameters: {'model__kernel': 'linear', 'model__gamma': 'auto', 'model__C': 1000000}
param_grid = {
"model__kernel": ["linear"],
"model__C": [1000000],
"model__gamma": ["auto"],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["SVM_select_k"] = model
results["SVM_select_k"] = score
times["SVM_select_k"] = total_time
print_results("SVM SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2331773.95 RMSE train: 3384381.68 | MAE train: 2251590.34 RMSE validation train: 3452046.46 | MAE validation train: 2272265.76 RMSE validation test: 3570029.36 | MAE validation test: 2374500.73
---------------------------------------------------
SVM SELECTED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model', SVR())]),
n_jobs=-1,
param_grid={'model__C': [1000000], 'model__gamma': ['auto'],
'model__kernel': ['linear'],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__C': 1000000, 'model__gamma': 'auto', 'model__kernel': 'linear', 'select__k': 61}
Performance: NMAE (val): -2331773.9543751357 | RMSE train: 3384381.6752997166 | MAE train: 2251590.3404441564 | RMSE train in validation: 3452046.4594078064 | MAE train in validation: 2272265.7630818374 | RMSE test in validation: 3570029.3628540644 | MAE test in validation: 2374500.7294935877
Execution time: 38.443318367004395s
Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class or mean prediction of the individual trees. Random forests improve on the decision tree model by reducing overfitting and increasing accuracy. This is achieved by generating multiple decision trees and then aggregating their predictions through a voting system.
from sklearn.ensemble import RandomForestRegressor
np.random.seed(10)
n_splits = 5
pipeline = Pipeline([("model", RandomForestRegressor(random_state=10))])
param_grid = {
"model__n_estimators": [100],
"model__criterion": ["squared_error"],
"model__max_depth": [None],
"model__min_samples_split": [2],
"model__max_features": [None],
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RandForest_pred"] = model
results["RandForest_pred"] = score
times["RandForest_pred"] = total_time
print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2453026.62 RMSE train: 1230647.08 | MAE train: 859275.53 RMSE validation train: 1247101.73 | MAE validation train: 871871.43 RMSE validation test: 3316103.20 | MAE validation test: 2268131.29
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('model',
RandomForestRegressor(random_state=10))]),
n_jobs=-1,
param_grid={'model__criterion': ['squared_error'],
'model__max_depth': [None],
'model__max_features': [None],
'model__min_samples_split': [2],
'model__n_estimators': [100]},
scoring='neg_mean_absolute_error')
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100}
Performance: NMAE (val): -2453026.6184210526 | RMSE train: 1230647.077689153 | MAE train: 859275.5293150685 | RMSE train in validation: 1247101.733618154 | MAE train in validation: 871871.4328767123 | RMSE test in validation: 3316103.1974173784 | MAE test in validation: 2268131.293150685
Execution time: 22.279118299484253s
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[("select", SelectKBest(f_regression)), ("model", RandomForestRegressor(random_state=10))]
)
param_grid = {
"model__n_estimators": [100],
"model__criterion": ["squared_error"],
"model__max_depth": [None],
"model__min_samples_split": [2],
"model__max_features": [None],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RandForest_pred_k"] = model
results["RandForest_pred_k"] = score
times["RandForest_pred_k"] = total_time
print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2453026.62 RMSE train: 1230647.08 | MAE train: 859275.53 RMSE validation train: 1246492.73 | MAE validation train: 872079.40 RMSE validation test: 3310640.67 | MAE validation test: 2264352.50
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
RandomForestRegressor(random_state=10))]),
n_jobs=-1,
param_grid={'model__criterion': ['squared_error'],
'model__max_depth': [None],
'model__max_features': [None],
'model__min_samples_split': [2],
'model__n_estimators': [100],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100, 'select__k': 72}
Performance: NMAE (val): -2453026.6184210526 | RMSE train: 1230647.077689153 | MAE train: 859275.5293150685 | RMSE train in validation: 1246492.7307536777 | MAE train in validation: 872079.3976027397 | RMSE test in validation: 3310640.668170457 | MAE test in validation: 2264352.497260274
Execution time: 144.7716188430786s
Building upon the previous definition, we can reduce the most important parameters to be adjusted to the following:
rmse = []
mae = []
rmse2 = []
mae2 = []
rmse3 = []
mae3 =[]
a_n_stimators = [10, 30, 50, 70, 100, 130, 170, 200]
a_max_depth = range(5, 36, 5)
a_min_samples_split = range(5, 200, 15)
for i in a_n_stimators:
model = RandomForestRegressor(random_state=10, n_estimators=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse3.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae3.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
for i in a_max_depth:
model = RandomForestRegressor(random_state=10, max_depth=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
for i in a_min_samples_split:
model = RandomForestRegressor(random_state=10, min_samples_split=i)
model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
y_pred = model.predict(X_test_5th_fold_train)
rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4, ax5, ax6) = plt.subplots(6, 1, figsize=(8, 12))
# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_max_depth), rmse, label="RMSE")
ax1.set_xlabel("max_depth")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")
# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_max_depth), mae, label="MAE")
ax2.set_xlabel("max_depth")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")
# Graficar RMSE vs. metric en el tercer subplot
ax3.plot(list(a_min_samples_split), rmse2, label="RMSE")
ax3.set_xlabel("min_samples_split")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")
# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(list(a_min_samples_split), mae2, label="MAE")
ax4.set_xlabel("min_samples_split")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")
# Graficar RMSE vs. metric en el tercer subplot
ax5.plot(a_n_stimators, rmse3, label="RMSE")
ax5.set_xlabel("n_estimators")
ax5.set_ylabel("RMSE")
# ax5.set_title("Gráfica de RMSE")
# Graficar MAE vs. metric en el cuarto subplot
ax6.plot(a_n_stimators, mae3, label="MAE")
ax6.set_xlabel("n_estimators")
ax6.set_ylabel("MAE")
ax6.set_title("Gráfica de MAE")
plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()
np.random.seed(10)
budget = 75
n_splits = 5
pipeline = Pipeline(
[
("model", RandomForestRegressor(random_state=10))
]
)
param_grid = {
"model__n_estimators": [100, 300, 350, 400, 450], # 500, 600, 700, 900, 10000 -> too slow for the minimal improvements they offer in the scoring (not even perceptible) - 450 still makes a decent improvement
"model__max_depth": list(range(5, 36, 5)),
"model__min_samples_split": [2, 3, 4, 5],
"model__max_features": ["sqrt"], # log2 does not offer as good results
}
model = RandomizedSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_iter=budget,
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RandForest_select"] = model
results["RandForest_select"] = score
times["RandForest_select"] = total_time
print_results("Random Forest", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2323073.36 RMSE train: 1248296.12 | MAE train: 871530.08 RMSE validation train: 1215594.37 | MAE validation train: 850946.23 RMSE validation test: 3230903.24 | MAE validation test: 2197047.20
---------------------------------------------------
Random Forest best model is:
RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('model',
RandomForestRegressor(random_state=10))]),
n_iter=75, n_jobs=-1,
param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
30, 35],
'model__max_features': ['sqrt'],
'model__min_samples_split': [2, 3, 4,
5],
'model__n_estimators': [100, 300, 350,
400, 450]},
scoring='neg_mean_absolute_error')
Parameters: {'model__n_estimators': 450, 'model__min_samples_split': 2, 'model__max_features': 'sqrt', 'model__max_depth': 25}
Performance: NMAE (val): -2323073.358721178 | RMSE train: 1248296.1226726803 | MAE train: 871530.0753454532 | RMSE train in validation: 1215594.3653978498 | MAE train in validation: 850946.2297440937 | RMSE test in validation: 3230903.2390529006 | MAE test in validation: 2197047.195952723
Execution time: 124.77998352050781s
np.random.seed(10)
n_splits = 5
pipeline = Pipeline(
[("select", SelectKBest(f_regression)), ("model", RandomForestRegressor(random_state=10))]
)
param_grid = {
"model__n_estimators": [450],
"model__max_depth": [25],
"model__min_samples_split": [2],
"model__max_features": ["sqrt"],
"select__k": list(range(1, X_train.shape[1] + 1)),
}
model = GridSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_jobs=-1,
)
start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["RandForest_select_k"] = model
results["RandForest_select_k"] = score
times["RandForest_select_k"] = total_time
print_results("Random Forest", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2322506.37 RMSE train: 1187813.56 | MAE train: 831397.07 RMSE validation train: 1216037.12 | MAE validation train: 853005.44 RMSE validation test: 3225078.82 | MAE validation test: 2191218.75
---------------------------------------------------
Random Forest best model is:
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
RandomForestRegressor(random_state=10))]),
n_jobs=-1,
param_grid={'model__max_depth': [25],
'model__max_features': ['sqrt'],
'model__min_samples_split': [2],
'model__n_estimators': [450],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'model__max_depth': 25, 'model__max_features': 'sqrt', 'model__min_samples_split': 2, 'model__n_estimators': 450, 'select__k': 69}
Performance: NMAE (val): -2322506.367627253 | RMSE train: 1187813.5582928217 | MAE train: 831397.0666560882 | RMSE train in validation: 1216037.1201242164 | MAE train in validation: 853005.4435113268 | RMSE test in validation: 3225078.8210638203 | MAE test in validation: 2191218.7493181694
Execution time: 147.9776096343994s
# Plotting the most used attributes
get_variable_freq()
KNN_pred_k selected 6
KNN_select_k selected 6
RegTrees_pred_k selected 9
RegTrees_select_k selected 4
LinearReg_pred_k selected 72
LinearReg_select_k selected 72
DummyReg_k selected 1
SVM_pred_k selected 1
SVM_select_k selected 61
RandForest_pred_k selected 72
RandForest_select_k selected 69
Attributes frequency: {'apcp_sf1_1': 3, 'apcp_sf2_1': 4, 'apcp_sf3_1': 5, 'apcp_sf4_1': 4, 'apcp_sf5_1': 5, 'dlwrf_s1_1': 4, 'dlwrf_s2_1': 4, 'dlwrf_s3_1': 5, 'dlwrf_s4_1': 5, 'dlwrf_s5_1': 5, 'dswrf_s1_1': 4, 'dswrf_s2_1': 8, 'dswrf_s3_1': 11, 'dswrf_s4_1': 9, 'dswrf_s5_1': 9, 'pres_ms1_1': 0, 'pres_ms2_1': 0, 'pres_ms3_1': 0, 'pres_ms4_1': 3, 'pres_ms5_1': 4, 'pwat_ea1_1': 3, 'pwat_ea2_1': 4, 'pwat_ea3_1': 4, 'pwat_ea4_1': 5, 'pwat_ea5_1': 5, 'spfh_2m1_1': 5, 'spfh_2m2_1': 5, 'spfh_2m3_1': 5, 'spfh_2m4_1': 5, 'spfh_2m5_1': 5, 'tcdc_ea1_1': 5, 'tcdc_ea2_1': 5, 'tcdc_ea3_1': 5, 'tcdc_ea4_1': 5, 'tcdc_ea5_1': 5, 'tcolc_e1_1': 5, 'tcolc_e2_1': 5, 'tcolc_e3_1': 5, 'tcolc_e4_1': 5, 'tcolc_e5_1': 5, 'tmax_2m1_1': 5, 'tmax_2m2_1': 5, 'tmax_2m3_1': 5, 'tmax_2m4_1': 5, 'tmax_2m5_1': 5, 'tmin_2m1_1': 5, 'tmin_2m2_1': 5, 'tmin_2m3_1': 5, 'tmin_2m4_1': 5, 'tmin_2m5_1': 5, 'tmp_2m_1_1': 5, 'tmp_2m_2_1': 5, 'tmp_2m_3_1': 5, 'tmp_2m_4_1': 5, 'tmp_2m_5_1': 5, 'tmp_sfc1_1': 5, 'tmp_sfc2_1': 5, 'tmp_sfc3_1': 5, 'tmp_sfc4_1': 5, 'tmp_sfc5_1': 5, 'ulwrf_s1_1': 5, 'ulwrf_s2_1': 5, 'ulwrf_s3_1': 5, 'ulwrf_s4_1': 6, 'ulwrf_s5_1': 6, 'ulwrf_t1_1': 5, 'ulwrf_t2_1': 5, 'ulwrf_t3_1': 5, 'ulwrf_t4_1': 5, 'ulwrf_t5_1': 5, 'uswrf_s1_1': 5, 'uswrf_s2_1': 9, 'uswrf_s3_1': 8, 'uswrf_s4_1': 5, 'uswrf_s5_1': 6, 'salida': 0}
# Getting the importance of each attribute in the models
print("Random forest feature importance")
feature_importance_arr = []
for model in models:
# Only for Random Forest
if model.startswith("RandForest"):
# Get the feature importances and attribute names
feature_importances = models[model].best_estimator_.named_steps["model"].feature_importances_
#attribute_names = models[model].best_estimator_.named_steps["preprocessor"].transformers_[0][2]
# Print the feature importance + the name of the attribute
for feature_importance in feature_importances:
feature_importance_arr.append(feature_importance)
print(f"{feature_importance_arr}")
Random forest feature importance [0.006823467236950922, 0.005289307300387328, 0.006267694296187506, 0.004022317151944208, 0.005404432357240497, 0.0022519491070036647, 0.0018361790164572482, 0.002247769952480248, 0.0014650044914462713, 0.0013006480224745894, 0.0004960851418488201, 0.0055382107495284315, 0.315742659652574, 0.3517793255489918, 0.04184928580091645, 0.002458978401449125, 0.002053964997419152, 0.0015936968889132862, 0.0020610986322658993, 0.0025472233010468913, 0.002879771282194894, 0.0024998467946694917, 0.0021275033383356613, 0.0022459040220456344, 0.003496953824292926, 0.0026076753823704698, 0.0017930838183311767, 0.0015719768382412133, 0.0027201445198449213, 0.003708652654260024, 0.003897506027980119, 0.003167200122163599, 0.0028718893936009737, 0.0018100864005531929, 0.0025377048235108606, 0.007044176167206148, 0.0062632988315099196, 0.004404341213717501, 0.003908374369991919, 0.00395205992060554, 0.0019721308557179005, 0.0011637037207955414, 0.000800379853040628, 0.000838520500870624, 0.0013666777131945717, 0.0009608105869797138, 0.0009190921579659422, 0.0008173443796027224, 0.001021960754386707, 0.00151280327981057, 0.0007639976942684739, 0.0009346498206002092, 0.0009534210634611878, 0.0012829642911412935, 0.002465317576758651, 0.001165180991507426, 0.0013971482408855862, 0.00199209205831166, 0.0024706120760282723, 0.00360350807458225, 0.0013436229867836167, 0.0021948655823592015, 0.0011227743340925592, 0.0018105761751751005, 0.0015890353220823447, 0.0033750976600800415, 0.00364699503427307, 0.003577394363633643, 0.003268232876439697, 0.0037954585711020995, 0.0003503972143710564, 0.008752891824028953, 0.04809102633776446, 0.05056290288042291, 0.009578965354528917, 0.0068403869301967395, 0.005373279303395462, 0.006372854124595899, 0.004111600062359494, 0.005382524543626419, 0.002376707941881877, 0.0018655300328057575, 0.0023187503945616595, 0.0014623332529318355, 0.0014143686169222132, 0.0004998132798549249, 0.00558797526039632, 0.3159371643993869, 0.351988610155103, 0.04212065117411924, 0.0034656893816911337, 0.0032270325690724857, 0.002935420343051062, 0.0023937024548202125, 0.0023119591202634343, 0.0022941716323672937, 0.003526358900124105, 0.0027365092286796883, 0.001765582954544091, 0.0017221784359597186, 0.002820601209027081, 0.003650330653143147, 0.003992074051631715, 0.0031344067030880975, 0.003074613747433749, 0.0018998704090882916, 0.0025954938809593956, 0.006990140912730289, 0.00644897734514765, 0.004473559756813118, 0.0037406250236200053, 0.00402454314681774, 0.002006163875451341, 0.0012158224737252418, 0.0007043584662778803, 0.0009642697822988204, 0.0012018319153018597, 0.0011034909135244964, 0.0009054555635728764, 0.0008879402231901818, 0.001085559920023129, 0.001955913969888609, 0.0007624518602052361, 0.0008781706836049682, 0.001029512529331001, 0.001349096954243529, 0.002501339848028483, 0.0011391828221795318, 0.001399662440200732, 0.0019101702579749115, 0.0024695480574327133, 0.003393099668721604, 0.0013818307482691425, 0.0021432896705809437, 0.0012325333968976205, 0.0016715611251833181, 0.0018575620441252682, 0.0037406654663174246, 0.0035201938928982276, 0.003608244753719502, 0.0034049511389126594, 0.0037936346289837663, 0.00032082449417820773, 0.008803707887331204, 0.0482002525909784, 0.05068142859196787, 0.009899886012268103, 0.003724035703835566, 0.0062359361179643216, 0.009240913210498341, 0.0058127312331301495, 0.005372061245997567, 0.0029465278723924157, 0.0033012693536005255, 0.003251694102810587, 0.0023560055460886955, 0.0024355632035187942, 0.0004236419769430372, 0.061599560073731824, 0.09121017194569941, 0.09190147878427775, 0.0899781182151589, 0.0023106723244598697, 0.002380129580585116, 0.002309647463192926, 0.0022045185874568505, 0.0023375751909906714, 0.0037045066333714696, 0.004394961453288986, 0.0036257550464911734, 0.0032218892535575885, 0.0031306358653928083, 0.002381892753595732, 0.002360840897174175, 0.0023876405322844378, 0.0029395366380763174, 0.003070244226964902, 0.007223606447359112, 0.011819211834030799, 0.008633007896986646, 0.0060502797575006685, 0.00399533509753541, 0.006818744483782783, 0.013415431657438318, 0.009817634506956403, 0.006095255590229829, 0.00460012059346887, 0.0021554413534535907, 0.0027284219254231665, 0.005250853161822686, 0.007624268986852327, 0.005098788125669833, 0.0018391598461120668, 0.001986793219779255, 0.0021815200639112757, 0.004969594575969665, 0.012872757247774012, 0.001951277869362719, 0.003982290976427233, 0.00603765245448148, 0.009867923533034227, 0.01298464716438702, 0.0019704910959919956, 0.00414342000609132, 0.012112937225630735, 0.02291973606720845, 0.010725025342562434, 0.0020494403497396535, 0.0032765366573998707, 0.003928727252962354, 0.013274934271127805, 0.02041052055492802, 0.002736095709877521, 0.006332732491709332, 0.008790365434515857, 0.011744338408147098, 0.017296899204761195, 0.0006843381698444326, 0.07671220695457114, 0.06854948839276186, 0.06464427316059666, 0.06514731984729216, 0.007733787879633859, 0.009203910580481157, 0.005883398098927908, 0.005586300117648668, 0.0033886070005346137, 0.003250393730715769, 0.003252941408309692, 0.002544006630893072, 0.0026915716395631848, 0.0005035528915993646, 0.05849207968349354, 0.088025909872651, 0.10228702484621725, 0.09221107319459551, 0.0027695957918516074, 0.004172588624545819, 0.004010671551562766, 0.0035812002224717434, 0.003512597770595825, 0.0026632339724947077, 0.0026167481464412046, 0.002547337600033131, 0.002973510335774411, 0.0035224074896548067, 0.00663833109510814, 0.011783420132970795, 0.008558065898171529, 0.005307442994300936, 0.003791095630677494, 0.008802753095846605, 0.014173866773218455, 0.009471583900944443, 0.006477952777471706, 0.0047612177007301335, 0.002430657290167516, 0.00329565654911136, 0.003905075888666834, 0.006302551285052551, 0.0062497040501772495, 0.0019487786598519168, 0.0020139206995887305, 0.00190188114483014, 0.008527372394746047, 0.012121269286707871, 0.002066044377839217, 0.0040608826634516235, 0.0040999243541863725, 0.004696990742636009, 0.009840383869773702, 0.0020716293517295164, 0.0037579750458387614, 0.011898670876887427, 0.021356473011829087, 0.011658375964503014, 0.002269671455828889, 0.0023767923179314177, 0.004373268570312354, 0.014836026245567995, 0.0182371460817263, 0.003090051935164686, 0.004645871298289147, 0.012897685394097504, 0.017111736203689196, 0.01860208498866022, 0.0004009093072549272, 0.07327670150901809, 0.08316695799290197, 0.059525515045730476, 0.0637951850661211]
# Get the 5 most important attributes
# Sort feature importances in descending order
if model.startswith("RandForest"):
print(model)
# Get the feature importances and attribute names
feature_importances = models[model].best_estimator_.named_steps["model"].feature_importances_
importances_descending = sorted(zip(feature_importances, disp_df.columns), reverse=True)
# Print top n attributes and their importances
n_top_attributes = 5
for importance, attribute_name in importances_descending[:n_top_attributes]:
print(f"{attribute_name}: {importance}")
RandForest_select_k dswrf_s3_1: 0.10228702484621725 dswrf_s4_1: 0.09221107319459551 dswrf_s2_1: 0.088025909872651 ulwrf_t2_1: 0.08316695799290197 ulwrf_t1_1: 0.07327670150901809
First of all, it must be understanded that the easiest way of knowing the most relevant attributes is by using the trees, since this model uses the attributes in ranked relevance to split data in each level. As we can see in the list provided, the most relevant attributes are the following:
The importance and meaning of this attributes can be found in the EDA section.
On the other hand, when checking the frequency of our selected attributes from our scoring function, we need to take into account that some of them are highly correlated with each other, but as it was stated before, our scoring function cant take into account this type of correlations, so it is important to keep in mind that the most relevant attributes are the ones that are highly correlated with the target variable.
That is why the most relevant attributes may be the ones that the Random Forest model has selected, since it is a tree-based model, and it is able to take into account the correlations between the attributes. But at the end all depends on quality of the data, the size of the dataset, the specific problem being solved, and the quality of the model.
# Select the even times (the ones that are not selectors of attributes)
times_no_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 == 0}
# Select the odd times (the ones that are selectors of attributes)
times_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 != 0}
# Sum both dictionaries to get the total time of each model
for key in times_atb.keys():
times_atb[key] += times_no_atb[key.replace("_k", "")]
times_no_atb_arr = list(times_no_atb.values())
times_atb_arr = list(times_atb.values())
# Solo los que empiezan por SVM o RandForest
model_indices = np.arange(len(list(times_no_atb.keys())))
width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(model_indices - width/2, times_no_atb_arr, width, label='No attribute selection')
rects2 = ax.bar(model_indices + width/2, times_atb_arr, width, label='Attribute selection')
ax.set_xlabel('Model')
ax.set_ylabel('Times')
ax.set_title('')
ax.set_xticks(model_indices)
ax.set_xticklabels(list(times_no_atb.keys()))
ax.legend()
plt.xticks(size=5.9)
plt.show()
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[0]))
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()
plt.xticks(rotation=30, ha='right', size=7)
plt.show()
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[6]))
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("MAE in test validation")
plt.tight_layout()
plt.xticks(rotation=30, ha='right', size=7)
plt.show()
The process of selecting the best parameters and attributes for advanced models is a crucial step in machine learning. In this study, the selection of attributes was performed using the SelectKBest method, which helps to eliminate attributes that have a low correlation with the output, resulting in better performance for the models. The selection of parameters was also a critical factor in improving the model's accuracy. In particular, the models with parameter selection achieved better scores than the models without parameter selection.
However, it's important to note that this improvement in performance comes with a trade-off: the increase in score implies a notable increase in time and computational cost. Therefore, when selecting the best models, it's essential to consider not only their performance but also their computational complexity.
After testing several models, the three best-performing models were identified as Random Forests, SVMs, and Linear Regression, with or without attribute and parameter selection. These models showed the highest scores in the experiments, and one of them will be selected as the final model in section 8.1.1, where we will compare the results of both SVM and Random Forests closely in order to make a wise decision.
Overall, the results of this study demonstrate the importance of selecting the right parameters and attributes when building advanced machine learning models. By doing so, we can achieve better accuracy and performance, leading to more effective and efficient machine learning applications.
We will re-visit all the models and select the best one, which we have stated to be the one with the lowest MAE and the lowest RMSE.
# ! Print the models best parameters
i = 0
for key, value in models.items():
print(f"\n\n{i}. Sected model: {key}\n")
print(f"Parameters: {value.best_params_}")
print(
f"\nPerformance:\n",
f"NMAE (val): {results[key][0]}\n",
f"RMSE train: {results[key][1]} | ",
f"MAE train: {results[key][2]}\n",
f"RMSE train in validation: {results[key][3]} | ",
f"MAE train in validation: {results[key][4]}\n",
f"RMSE test in validation: {results[key][5]} | ",
f"MAE test in validation: {results[key][6]}",
sep="",
)
print(f"Time: {times[key]} s")
i+=1
plt.rcParams['figure.figsize'] = [10, 3.5]
# ! Plot (NMAE)
print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[0]))
print(f"{iter}. {key}: {abs(value[0])}")
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/advanced_methods_score.png")
plt.show()
# ! Plot (MAE train in validation)
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[4]))
print(f"{iter}. {key}: {abs(value[4])}")
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("MAE train in validation")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_score.png")
plt.show()
# ! Plot (MAE test in validation)
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[6]))
print(f"{iter}. {key}: {abs(value[6])}")
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("MAE test in validation")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_score.png")
plt.show()
# ! Plot the time
iter = 0
for key, value in times.items():
plt.bar(key, value)
print(f"{iter}. {key}: {value}")
iter += 1
plt.title("Time")
plt.xlabel("Model")
plt.ylabel("Time (s)")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# ! Plot the accumulated approximated real times
print("Accumulated approximated real times")
# Select the even times (the ones that are not selectors of attributes)
times_no_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 == 0}
# Select the odd times (the ones that are selectors of attributes)
times_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 != 0}
# Sum both dictionaries to get the total time of each model
for key in times_atb.keys():
times_atb[key] += times_no_atb[key.replace("_k", "")]
times_no_atb_arr = list(times_no_atb.values())
times_atb_arr = list(times_atb.values())
model_indices = np.arange(len(list(times_no_atb.keys())))
width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(model_indices - width/2, times_no_atb_arr, width, label='No attribute selection')
rects2 = ax.bar(model_indices + width/2, times_atb_arr, width, label='Attribute selection')
ax.set_xlabel('Model')
ax.set_ylabel('Times')
ax.set_title('')
ax.set_xticks(model_indices)
ax.set_xticklabels(list(times_no_atb.keys()))
ax.legend()
plt.xticks(size=5.9)
plt.show()
# Exporting image as png to ../data/img folder - easier to visualize the annotations, better resolution
plt.savefig("../data/img/advanced_methods_time.png")
plt.show()
0. Sected model: KNN_pred
Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform'}
Performance:
NMAE (val): -3239984.25
RMSE train: 3517654.379918169 | MAE train: 2493007.2164383563
RMSE train in validation: 3557480.484807456 | MAE train in validation: 2518007.157534247
RMSE test in validation: 4152140.058048495 | MAE test in validation: 2892257.01369863
Time: 4.260819673538208 s
1. Sected model: KNN_pred_k
Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform', 'select__k': 6}
Performance:
NMAE (val): -2690780.4078947366
RMSE train: 3108869.5243311627 | MAE train: 2162755.2657534247
RMSE train in validation: 3116226.5417679944 | MAE train in validation: 2171515.705479452
RMSE test in validation: 3775814.085873258 | MAE test in validation: 2560118.5479452056
Time: 8.521639347076416 s
2. Sected model: KNN_select
Parameters: {'model__weights': 'distance', 'model__n_neighbors': 17, 'model__metric': 'manhattan', 'model__algorithm': 'kd_tree'}
Performance:
NMAE (val): -2880131.5631625694
RMSE train: 0.0 | MAE train: 0.0
RMSE train in validation: 0.0 | MAE train in validation: 0.0
RMSE test in validation: 3732609.9812009404 | MAE test in validation: 2587777.1287017944
Time: 6.565516233444214 s
3. Sected model: KNN_select_k
Parameters: {'model__algorithm': 'kd_tree', 'model__metric': 'manhattan', 'model__n_neighbors': 9, 'model__weights': 'distance', 'select__k': 6}
Performance:
NMAE (val): -2603870.865432223
RMSE train: 1355.336484531192 | MAE train: 31.726027397260275
RMSE train in validation: 25827.044900407618 | MAE train in validation: 675.9246575342465
RMSE test in validation: 3681057.75211333 | MAE test in validation: 2483096.382277287
Time: 11.211513996124268 s
4. Sected model: RegTrees_pred
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2}
Performance:
NMAE (val): -3467149.4407894737
RMSE train: 0.0 | MAE train: 0.0
RMSE train in validation: 0.0 | MAE train in validation: 0.0
RMSE test in validation: 4961507.791413844 | MAE test in validation: 3406755.205479452
Time: 0.5705435276031494 s
5. Sected model: RegTrees_pred_k
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'select__k': 9}
Performance:
NMAE (val): -3328832.171052632
RMSE train: 0.0 | MAE train: 0.0
RMSE train in validation: 0.0 | MAE train in validation: 0.0
RMSE test in validation: 5002502.819275869 | MAE test in validation: 3460965.616438356
Time: 3.9039855003356934 s
6. Sected model: RegTrees_select
Parameters: {'model__min_samples_split': 106, 'model__max_features': None, 'model__max_depth': 30, 'model__criterion': 'absolute_error'}
Performance:
NMAE (val): -2743220.575657895
RMSE train: 3259190.446254432 | MAE train: 2080612.602739726
RMSE train in validation: 3286556.310045412 | MAE train in validation: 2092567.191780822
RMSE test in validation: 3914582.5939823505 | MAE test in validation: 2655352.602739726
Time: 16.35970973968506 s
7. Sected model: RegTrees_select_k
Parameters: {'model__criterion': 'absolute_error', 'model__max_depth': 30, 'model__max_features': None, 'model__min_samples_split': 106, 'select__k': 4}
Performance:
NMAE (val): -2727416.151315789
RMSE train: 3452866.617242818 | MAE train: 2199234.328767123
RMSE train in validation: 3561457.960699349 | MAE train in validation: 2280089.2808219176
RMSE test in validation: 4044668.035092536 | MAE test in validation: 2710957.602739726
Time: 44.26800060272217 s
8. Sected model: LinearReg_pred
Parameters: {'model__fit_intercept': True}
Performance:
NMAE (val): -2437056.0592061607
RMSE train: 3254352.603690468 | MAE train: 2321647.0597032406
RMSE train in validation: 3265297.879240584 | MAE train in validation: 2322380.6106294743
RMSE test in validation: 3268115.4760430153 | MAE test in validation: 2265683.802964292
Time: 0.2956578731536865 s
9. Sected model: LinearReg_pred_k
Parameters: {'model__fit_intercept': True, 'select__k': 72}
Performance:
NMAE (val): -2421796.652193799
RMSE train: 3256573.9989301027 | MAE train: 2323171.6092511206
RMSE train in validation: 3267629.5529683903 | MAE train in validation: 2322601.753096195
RMSE test in validation: 3267567.87998712 | MAE test in validation: 2263068.4012916926
Time: 2.5743250846862793 s
10. Sected model: LinearReg_select
Parameters: {'model__alpha': 1.0642092440647246}
Performance:
NMAE (val): -2396352.0117066414
RMSE train: 3276534.918917554 | MAE train: 2333075.6766354376
RMSE train in validation: 3292824.947383585 | MAE train in validation: 2337932.2861476443
RMSE test in validation: 3280253.2990303193 | MAE test in validation: 2260087.8287112545
Time: 4.527256488800049 s
11. Sected model: LinearReg_select_k
Parameters: {'model__alpha': 0.9693631061142517, 'select__k': 72}
Performance:
NMAE (val): -2389586.491181177
RMSE train: 3278341.466529396 | MAE train: 2333541.305110323
RMSE train in validation: 3293274.5203141714 | MAE train in validation: 2336194.5845998474
RMSE test in validation: 3278610.608896576 | MAE test in validation: 2258218.4050652594
Time: 6.653132915496826 s
12. Sected model: DummyReg
Parameters: {'model__strategy': 'median'}
Performance:
NMAE (val): -6953359.144736841
RMSE train: 8058570.051086258 | MAE train: 6899205.369863014
RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452
RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Time: 0.23421549797058105 s
13. Sected model: DummyReg_k
Parameters: {'model__strategy': 'median', 'select__k': 1}
Performance:
NMAE (val): -6953359.144736841
RMSE train: 8058570.051086258 | MAE train: 6899205.369863014
RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452
RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Time: 1.9249184131622314 s
14. Sected model: SVM_pred
Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf'}
Performance:
NMAE (val): -6953343.117286754
RMSE train: 8058525.276321875 | MAE train: 6899170.647284467
RMSE train in validation: 8120576.9337218935 | MAE train in validation: 6944009.686073507
RMSE test in validation: 7809107.037449846 | MAE test in validation: 6720917.536373118
Time: 2.1233932971954346 s
15. Sected model: SVM_pred_k
Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf', 'select__k': 1}
Performance:
NMAE (val): -6952999.553407727
RMSE train: 8057851.465218631 | MAE train: 6898491.591050504
RMSE train in validation: 8120039.578355538 | MAE train in validation: 6943474.801851249
RMSE test in validation: 7808606.977541712 | MAE test in validation: 6720375.017111049
Time: 29.39649200439453 s
16. Sected model: SVM_select
Parameters: {'model__kernel': 'linear', 'model__gamma': 'auto', 'model__C': 1000000}
Performance:
NMAE (val): -2331297.199428374
RMSE train: 3390722.8061495544 | MAE train: 2254336.0822791597
RMSE train in validation: 3402804.0781806447 | MAE train in validation: 2244918.82901025
RMSE test in validation: 3486393.9201029483 | MAE test in validation: 2328968.7736492744
Time: 165.52876663208008 s
17. Sected model: SVM_select_k
Parameters: {'model__C': 1000000, 'model__gamma': 'auto', 'model__kernel': 'linear', 'select__k': 61}
Performance:
NMAE (val): -2331773.9543751357
RMSE train: 3384381.6752997166 | MAE train: 2251590.3404441564
RMSE train in validation: 3452046.4594078064 | MAE train in validation: 2272265.7630818374
RMSE test in validation: 3570029.3628540644 | MAE test in validation: 2374500.7294935877
Time: 38.443318367004395 s
18. Sected model: RandForest_pred
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100}
Performance:
NMAE (val): -2453026.6184210526
RMSE train: 1230647.077689153 | MAE train: 859275.5293150685
RMSE train in validation: 1247101.733618154 | MAE train in validation: 871871.4328767123
RMSE test in validation: 3316103.1974173784 | MAE test in validation: 2268131.293150685
Time: 22.279118299484253 s
19. Sected model: RandForest_pred_k
Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100, 'select__k': 72}
Performance:
NMAE (val): -2453026.6184210526
RMSE train: 1230647.077689153 | MAE train: 859275.5293150685
RMSE train in validation: 1246492.7307536777 | MAE train in validation: 872079.3976027397
RMSE test in validation: 3310640.668170457 | MAE test in validation: 2264352.497260274
Time: 144.7716188430786 s
20. Sected model: RandForest_select
Parameters: {'model__n_estimators': 450, 'model__min_samples_split': 2, 'model__max_features': 'sqrt', 'model__max_depth': 25}
Performance:
NMAE (val): -2323073.358721178
RMSE train: 1248296.1226726803 | MAE train: 871530.0753454532
RMSE train in validation: 1215594.3653978498 | MAE train in validation: 850946.2297440937
RMSE test in validation: 3230903.2390529006 | MAE test in validation: 2197047.195952723
Time: 124.77998352050781 s
21. Sected model: RandForest_select_k
Parameters: {'model__max_depth': 25, 'model__max_features': 'sqrt', 'model__min_samples_split': 2, 'model__n_estimators': 450, 'select__k': 69}
Performance:
NMAE (val): -2322506.367627253
RMSE train: 1187813.5582928217 | MAE train: 831397.0666560882
RMSE train in validation: 1216037.1201242164 | MAE train in validation: 853005.4435113268
RMSE test in validation: 3225078.8210638203 | MAE test in validation: 2191218.7493181694
Time: 147.9776096343994 s
MODEL SCORES (NMAE in evaluation)
0. KNN_pred: 3239984.25
1. KNN_pred_k: 2690780.4078947366
2. KNN_select: 2880131.5631625694
3. KNN_select_k: 2603870.865432223
4. RegTrees_pred: 3467149.4407894737
5. RegTrees_pred_k: 3328832.171052632
6. RegTrees_select: 2743220.575657895
7. RegTrees_select_k: 2727416.151315789
8. LinearReg_pred: 2437056.0592061607
9. LinearReg_pred_k: 2421796.652193799
10. LinearReg_select: 2396352.0117066414
11. LinearReg_select_k: 2389586.491181177
12. DummyReg: 6953359.144736841
13. DummyReg_k: 6953359.144736841
14. SVM_pred: 6953343.117286754
15. SVM_pred_k: 6952999.553407727
16. SVM_select: 2331297.199428374
17. SVM_select_k: 2331773.9543751357
18. RandForest_pred: 2453026.6184210526
19. RandForest_pred_k: 2453026.6184210526
20. RandForest_select: 2323073.358721178
21. RandForest_select_k: 2322506.367627253
0. KNN_pred: 2518007.157534247 1. KNN_pred_k: 2171515.705479452 2. KNN_select: 0.0 3. KNN_select_k: 675.9246575342465 4. RegTrees_pred: 0.0 5. RegTrees_pred_k: 0.0 6. RegTrees_select: 2092567.191780822 7. RegTrees_select_k: 2280089.2808219176 8. LinearReg_pred: 2322380.6106294743 9. LinearReg_pred_k: 2322601.753096195 10. LinearReg_select: 2337932.2861476443 11. LinearReg_select_k: 2336194.5845998474 12. DummyReg: 6944040.205479452 13. DummyReg_k: 6944040.205479452 14. SVM_pred: 6944009.686073507 15. SVM_pred_k: 6943474.801851249 16. SVM_select: 2244918.82901025 17. SVM_select_k: 2272265.7630818374 18. RandForest_pred: 871871.4328767123 19. RandForest_pred_k: 872079.3976027397 20. RandForest_select: 850946.2297440937 21. RandForest_select_k: 853005.4435113268
0. KNN_pred: 2892257.01369863 1. KNN_pred_k: 2560118.5479452056 2. KNN_select: 2587777.1287017944 3. KNN_select_k: 2483096.382277287 4. RegTrees_pred: 3406755.205479452 5. RegTrees_pred_k: 3460965.616438356 6. RegTrees_select: 2655352.602739726 7. RegTrees_select_k: 2710957.602739726 8. LinearReg_pred: 2265683.802964292 9. LinearReg_pred_k: 2263068.4012916926 10. LinearReg_select: 2260087.8287112545 11. LinearReg_select_k: 2258218.4050652594 12. DummyReg: 6720947.2602739725 13. DummyReg_k: 6720947.2602739725 14. SVM_pred: 6720917.536373118 15. SVM_pred_k: 6720375.017111049 16. SVM_select: 2328968.7736492744 17. SVM_select_k: 2374500.7294935877 18. RandForest_pred: 2268131.293150685 19. RandForest_pred_k: 2264352.497260274 20. RandForest_select: 2197047.195952723 21. RandForest_select_k: 2191218.7493181694
0. KNN_pred: 4.260819673538208 1. KNN_pred_k: 8.521639347076416 2. KNN_select: 6.565516233444214 3. KNN_select_k: 11.211513996124268 4. RegTrees_pred: 0.5705435276031494 5. RegTrees_pred_k: 3.9039855003356934 6. RegTrees_select: 16.35970973968506 7. RegTrees_select_k: 44.26800060272217 8. LinearReg_pred: 0.2956578731536865 9. LinearReg_pred_k: 2.5743250846862793 10. LinearReg_select: 4.527256488800049 11. LinearReg_select_k: 6.653132915496826 12. DummyReg: 0.23421549797058105 13. DummyReg_k: 1.9249184131622314 14. SVM_pred: 2.1233932971954346 15. SVM_pred_k: 29.39649200439453 16. SVM_select: 165.52876663208008 17. SVM_select_k: 38.443318367004395 18. RandForest_pred: 22.279118299484253 19. RandForest_pred_k: 144.7716188430786 20. RandForest_select: 124.77998352050781 21. RandForest_select_k: 147.9776096343994 Accumulated approximated real times
<Figure size 1000x350 with 0 Axes>
As it will be discussed later in section 8.1.1, although the SVM has a sligthly better -NMAE scoring, when in validation test, the best model by far is the Random Forests with selection of attributes and selection of parameters.
Timewise, as we argued before, is not relevant for us since the training is probably a one-time process, and the prediction is the one that is going to be used in the real world for a long time period. If not, as we will see later, we would choose the Random Forests with selection of attributes which offers a better performance in terms of MAE and RMSE.
Ultimately, if the time is a critical value for the client, we will choose the Linear Regression (already discussed the different models in section 5.5), as it is blazingly fast.
After carefully evaluating the results and considering various factors, we have come to the conclusion that Random Forest is the optimal choice for our model. Apart from outperforming SVM in terms of MAE and RMSE measurements in the 5th fold test-validation, Random Forest also offers several advantages.
One key advantage is its ability to handle non-linear relationships in the data. Random Forest employs a decision tree ensemble approach, which allows for capturing complex interactions and patterns in the data, making it a suitable choice for our dataset that may contain non-linear relationships between variables.
Additionally, Random Forest is known for its robustness to outliers and noise in the data. It is less sensitive to noisy data points compared to SVM, which can be especially beneficial when dealing with real-world datasets that often contain noise or outliers.
Furthermore, Random Forest is a highly scalable algorithm that can efficiently handle large datasets, making it suitable for our computational capabilities. On the other hand, SVM can be computationally intensive, especially with larger datasets and higher values of C, which may not be feasible in our current computational setup.
Although SVM has the potential to improve with an increase in C, we have weighed the trade-off between computation time and scoring results, and determined that Random Forest provides a favorable balance for our specific needs. A similar trade-off happens with Random Forest and the number of trees (estimators), but not that drastically.
The problem is that both C for SVM and estimators for Random Forest make the model better infinitely, but the computational cost is not linear, so the minimal (almost negligible) gain in performance is not worth the computational cost.
After considering both the -NMAE in validation and the test results in the fifth fold of validation, we have decided to use Random Forest as our preferred model. This is because the validation test allows us to assess how promisingly the model will perform in the actual test. Although SVM has slightly better performance in terms of -NMAE scoring, the marginal gain is not significant enough to outweigh its much worse performance (although good) in the validation test compared to Random Forest.
In conclusion, considering its superior performance in our validation tests, ability to handle non-linear relationships, robustness to noise, scalability, and computational efficiency, we have decided to select Random Forest as our preferred model for this particular project.
# The selected model is RandForest_select_k
sel_model = models["RandForest_select_k"]
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
True,
X_test,
y_test,
)
Results of the best estimator of Pipeline NMAE in validation: -2489180.03 RMSE train: 2071531.63 | MAE train: 1285734.74 RMSE test: 3071048.88 | MAE test: 2111167.90 RMSE validation train: 1216037.12 | MAE validation train: 853005.44 RMSE validation test: 3225078.82 | MAE validation test: 2191218.75
As we hypothesized before, the test scores are better than the validation scores (both scoring timeseries validation and test validation), which is a good sign and expected as they were an understimation. Overall, the score is very good, with a result of: RMSE test: 3071048.88 | MAE test: 2111167.90
Once selected the best model, we will train in with all the data we have available, and then we will use it to predict the values of the competition dataset.
First, we divide the whole dataset into the training set (inputs, X, and outputs, y). Then, we train the model with the whole dataset, and to be predicted, the model should perform better than the one we selected first, as it has more training data.
X_train = disp_df.drop("salida", axis=1) # This is the input features for training
y_train = disp_df["salida"] # This is the target variable for training
print("Data shape: ", disp_df.shape)
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
Data shape: (4380, 76) X_train shape: (4380, 75) y_train shape: (4380,)
# We will use the whole dataset to train the model - disp_df
np.random.seed(10)
budget = 100
n_splits = 5
pipeline = Pipeline(
[
("select", SelectKBest(f_regression)),
("model", RandomForestRegressor(random_state=10))
]
)
param_grid = {
"model__n_estimators": [100, 300, 350, 400, 450], # 500, 600, 700, 900, 10000 -> too slow for the minimal improvements they offer in the scoring (not even perceptible) - 450 still makes a decent improvement
"model__max_depth": list(range(5, 36, 5)),
"model__min_samples_split": [2, 3, 4, 5],
"model__max_features": ["sqrt"], # log2 does not offer as good results
"select__k": list(range(1, X_train.shape[1])),
}
model = RandomizedSearchCV(
pipeline,
param_grid,
scoring="neg_mean_absolute_error",
cv=TimeSeriesSplit(n_splits),
n_iter=budget,
n_jobs=-1,
)
start_time = time.time()
model.fit(X_train, y_train)
end_time = time.time()
total_time = end_time - start_time
# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train) # We al ready did the 5th fold split at the begginning
# We obtain the different scores of the model
score = train_validation_test(
model,
model.best_estimator_,
model.best_score_,
X_train,
y_train,
)
models["final_model"] = model
results["final_model"] = score
times["final_model"] = total_time
print_results("Random Forest (Final model)", model, score, total_time)
Results of the best estimator of Pipeline NMAE in validation: -2324597.85 RMSE train: 1249422.71 | MAE train: 868956.42 RMSE validation train: 1280013.14 | MAE validation train: 889329.11 RMSE validation test: 3218699.57 | MAE validation test: 2189210.81
---------------------------------------------------
Random Forest (Final model) best model is:
RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
RandomForestRegressor(random_state=10))]),
n_iter=100, n_jobs=-1,
param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
30, 35],
'model__max_features': ['sqrt'],
'model__min_samples_split': [2, 3, 4,
5],
'model__n_estimators': [100, 300, 350,
400, 450],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27,
28, 29, 30, ...]},
scoring='neg_mean_absolute_error')
Parameters: {'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}
Performance: NMAE (val): -2324597.848848228 | RMSE train: 1249422.7076020176 | MAE train: 868956.4214434926 | RMSE train in validation: 1280013.139932164 | MAE train in validation: 889329.1075172715 | RMSE test in validation: 3218699.573567223 | MAE test in validation: 2189210.8078122223
Execution time: 117.42434859275818s
To be observed, just as before, the scoring -NMAE is not as good as the test evaluation partition (which is a good indicator of the performance of the model with the competition dataset), but it is still a good indicator of the performance of the model: NMAE in validation: -2324597.85 | RMSE validation test: 3218699.57 | MAE validation test: 2189210.81
Note how the results are similar (better) to the ones predicted in the validation fold scoring, validation test, and test previously.
The bad thing about using the whole dataset for training is that we don't have any data left for testing the model's performance. Without a separate set of data for testing, we cannot accurately evaluate how well the model generalizes to unseen data.
To address this issue, we have implemented a function that calculates the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) of the model on the fifth fold of the train-validation splits. This allows us to obtain an estimate of the model's performance on the most trained fold, which can serve as an indication of how well the model is likely to perform in the near future.
plt.rcParams['figure.figsize'] = [10, 3.5]
print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[0]))
print(f"{iter}. {key}: {abs(value[0])}")
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/best_methods_score.png")
plt.show()
print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
plt.bar(key, abs(value[6]))
print(f"{iter}. {key}: {abs(value[6])}")
iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()
plt.xticks(rotation=45, ha='right', size=7)
# Exporting image as png to ../data/img folder
plt.savefig("../data/img/best_methods_score.png")
plt.show()
MODEL SCORES (NMAE in evaluation) 0. KNN_pred: 3239984.25 1. KNN_pred_k: 2690780.4078947366 2. KNN_select: 2880131.5631625694 3. KNN_select_k: 2603870.865432223 4. RegTrees_pred: 3467149.4407894737 5. RegTrees_pred_k: 3328832.171052632 6. RegTrees_select: 2743220.575657895 7. RegTrees_select_k: 2727416.151315789 8. LinearReg_pred: 2437056.0592061607 9. LinearReg_pred_k: 2421796.652193799 10. LinearReg_select: 2396352.0117066414 11. LinearReg_select_k: 2389586.491181177 12. DummyReg: 6953359.144736841 13. DummyReg_k: 6953359.144736841 14. SVM_pred: 6953343.117286754 15. SVM_pred_k: 6952999.553407727 16. SVM_select: 2331297.199428374 17. SVM_select_k: 2331773.9543751357 18. RandForest_pred: 2453026.6184210526 19. RandForest_pred_k: 2453026.6184210526 20. RandForest_select: 2323073.358721178 21. RandForest_select_k: 2322506.367627253 22. final_model: 2324597.848848228
MODEL SCORES (NMAE in evaluation) 0. KNN_pred: 2892257.01369863 1. KNN_pred_k: 2560118.5479452056 2. KNN_select: 2587777.1287017944 3. KNN_select_k: 2483096.382277287 4. RegTrees_pred: 3406755.205479452 5. RegTrees_pred_k: 3460965.616438356 6. RegTrees_select: 2655352.602739726 7. RegTrees_select_k: 2710957.602739726 8. LinearReg_pred: 2265683.802964292 9. LinearReg_pred_k: 2263068.4012916926 10. LinearReg_select: 2260087.8287112545 11. LinearReg_select_k: 2258218.4050652594 12. DummyReg: 6720947.2602739725 13. DummyReg_k: 6720947.2602739725 14. SVM_pred: 6720917.536373118 15. SVM_pred_k: 6720375.017111049 16. SVM_select: 2328968.7736492744 17. SVM_select_k: 2374500.7294935877 18. RandForest_pred: 2268131.293150685 19. RandForest_pred_k: 2264352.497260274 20. RandForest_select: 2197047.195952723 21. RandForest_select_k: 2191218.7493181694 22. final_model: 2189210.8078122223
As mentioned before, the results regarding scoring of the final model are the best overall.
import pickle
print(models["final_model"].best_params_)
selected_model = models["final_model"]
print(f"\nSelected model: {selected_model}, {type(selected_model)}, {selected_model.best_params_}")
# Export model as pickle file in ../data/model folder
with open("../data/model/modelo_final.pkl", "wb") as file:
pickle.dump(selected_model, file)
# ! Compare the model exported with the one loaded - check if it is the same
# Load model from pickle file
with open("../data/model/modelo_final.pkl", "rb") as file:
loaded_model = pickle.load(file)
print(f"\nSaved model: {loaded_model}, {type(loaded_model)}, {loaded_model.best_params_}")
if selected_model.best_params_ == loaded_model.best_params_:
print("\n\nThe models has been saved and loaded correctly")
else:
print("\n\nERROR: The models are different")
{'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}
Selected model: RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
RandomForestRegressor(random_state=10))]),
n_iter=100, n_jobs=-1,
param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
30, 35],
'model__max_features': ['sqrt'],
'model__min_samples_split': [2, 3, 4,
5],
'model__n_estimators': [100, 300, 350,
400, 450],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27,
28, 29, 30, ...]},
scoring='neg_mean_absolute_error'), <class 'sklearn.model_selection._search.RandomizedSearchCV'>, {'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}
Saved model: RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
estimator=Pipeline(steps=[('select',
SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
('model',
RandomForestRegressor(random_state=10))]),
n_iter=100, n_jobs=-1,
param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
30, 35],
'model__max_features': ['sqrt'],
'model__min_samples_split': [2, 3, 4,
5],
'model__n_estimators': [100, 300, 350,
400, 450],
'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27,
28, 29, 30, ...]},
scoring='neg_mean_absolute_error'), <class 'sklearn.model_selection._search.RandomizedSearchCV'>, {'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}
The models has been saved and loaded correctly
During this project, we have had the opportunity to gain a deeper understanding of the model selection process. We began with exploratory data analysis (EDA), which helped us to improve our understanding and management of the data. We found this to be an extremely useful tool throughout the entire project. We believe that this part of the project should be evaluated with greater emphasis, as it is the foundation upon which all of our decisions were based.
Next, we created and trained all of our models, gaining experience in the use of pipelines and a deeper understanding of the importance of hyperparameters. Finally, we analyzed the different results provided by each model, gaining a better understanding of their respective advantages and disadvantages in terms of scoring and time.
We believe that this project is an excellent complement to the main lessons, as it provides a deeper understanding of the subject matter.
import os
# Export the notebook to HTML
os.system("jupyter nbconvert --to html model.ipynb --output ../data/html/model.html")
print("Notebook exported to HTML")
[NbConvertApp] Converting notebook model.ipynb to html
Notebook exported to HTML
[NbConvertApp] Writing 16124216 bytes to ../data/html/model.html